← Back home
Feb 7, 2026 · 10 min read

The Real Flywheel

TL;DR: A flywheel that needs humans to spin it is a treadmill. A coding agent is not.

Almost a year ago, right after my team (post-training) shipped several major GPT‑4o updates, I felt exhausted — excited, but oddly disoriented.

DAU looked great. The holdout set showed an upper single-digit lift. In any normal product org, that’s champagne.

And we had, in a sense, found what search / recommendation / ads people have always treated as the holy grail: a large user base; a plausible story for turning behavior into training signal; and a crew of MLEs who could stack endless small wins until the metrics moved.

But it still wasn’t AGI. It was the old comfortable paradigm. The wins didn’t compound into a capability jump; the curve could flatten at any moment.

It felt like raising a digient in The Lifecycle of Software Objects — they keep getting better, but they don’t grow up.

That realization drained me. I started looking for a different kind of work.

Some Flywheels

In search / recs / ads, the canonical flywheel is: you ship, users show up, behavior becomes signal, the model improves, and you ship again.

At scale, the user base is not only the prize — it’s inertia. Data stops feeling like fuel and starts acting like stored momentum: you bank it, and you spend it to move faster.

Autonomous driving was the same story with different nouns: miles become edge cases; edge cases become autonomy; autonomy buys more miles.

The Real Flywheel

A flywheel that needs humans to spin it is a treadmill.

With humans in the loop, iteration speed is capped by meetings, reviews, coordination, and the tiny number of changes you can safely push at once.

A/B testing is the slowest version of this: you burn millions of interactions to buy one decision. Even “better” loops — CTR/CVR models, implicit feedback training — are still downstream of UI exposure and confounders. The signal is real, but it’s shaped by humans at every step.

The real question is stricter:

Can we turn ideas into clean, attributable evidence on demand — repeatedly — without a human ceremony layer each time?

If you’ve used Codex / Claude Code seriously, you can feel the first crack in the treadmill. Once enough engineers believe agents are real, the loop starts closing on itself.

Data Point and Researcher

My mental model is blunt: if you have enough experimental datapoints, even an okay researcher will find the manifold.

With 10× more datapoints, an average PhD can look like Ilya.

With 10^5× more datapoints, a strong agent can start to look like a serious researcher.

Not because theory stops mattering, but because a lot of frontier work sits uncomfortably close to the empirical end of the spectrum: you learn by trying.

And a “datapoint” here is not a single training example. It’s an end-to-end belief update produced by a long rollout of humans, GPUs, and org structure — from data center capacity to an experiment proposal, data plumbing, training runs, babysitting failures, and writing the Slack post.

A clean ablation. A failure with a named failure mode and a concrete fix. A new eval slice that exposes a blind spot. An online signal you can trace back to a capability.

I don’t buy that idea supply is the bottleneck in frontier labs. GPU supply isn’t either — money can buy more GPUs (and hire brains).

The bottleneck is how fast you can convert an idea into a high‑signal datapoint — and then do it again.

When people say “AI improves itself,” the first thing that happens is probably not AI proposing one brilliant idea that reshapes the landscape. It’s a boring multiplier: more attempts per unit time, per GPU, per researcher, per engineer.

Genius shifts the distribution. Agent‑accelerated infrastructure increases the number of draws.

RL Infra 2026

We are here in the AI‑2027 narrative. And people still hear “RL infra” and picture train/eval plumbing.

That’s table stakes.

What I mean is the system that manufactures effective datapoints with minimal human intervention. I think this is already starting to happen with tools like Codex. For the first time, the tool isn’t just a wrench. It can carry intent across steps.

Infrastructure used to be like a lathe: rigid, purpose-built, expensive to change. Increasingly it wants to look like an iPhone: a general platform you keep reshaping — by adding apps — as your needs change. And the installation of apps is smoother and smoother.

If 2025 was about shipping model updates, then 2026 is about making the loop run by itself.

Some friends in GDM keep saying they’re doing “best‑of‑N.” I want to say: N now includes the agents.

Appendix: Control-Loop Framing

The flywheel is a control loop with gain

Think of the lab as a closed loop:

$$M \to A \to (\tau, \bar{q}) \to D_{\mathrm{eff}} \to M$$
  • \(M\): model capability (what the model can do)
  • \(A\): agent capability (how reliably it can execute multi-step work: implement, debug, run experiments, report)
  • \(\tau\): cycle time (idea → experiment → result → attribution → next step)
  • \(\bar{q}\): average signal quality (how clean/attributable/reproducible each datapoint is)
  • \(D_{\mathrm{eff}}\): effective datapoints produced per time window

What the “gain” product is saying

$$G \equiv \left(\frac{\partial M}{\partial D_{\mathrm{eff}}}\right) \cdot \left(\frac{\partial D_{\mathrm{eff}}}{\partial A}\right) \cdot \left(\frac{\partial A}{\partial M}\right)$$

This breaks the loop into three sensitivities:

  1. \(\frac{\partial A}{\partial M}\)When the model improves, how much does agent capability actually improve?

    If model gains don’t show up as better tool use, self-debugging, or multi-step reliability, this term stays small. The shift from “chatty” models to reasoning models / coding agents is mostly an increase in this term.

  2. \(\frac{\partial D_{\mathrm{eff}}}{\partial A}\)When agents improve, how much does your effective datapoint factory improve?

    This is the “agents are inside the production line” term. If your infra org doesn’t adopt agents and still requires human ceremony for every run—manual setup, manual debugging, manual reporting—this term remains small. This is the main argument of this post.

  3. \(\frac{\partial M}{\partial D_{\mathrm{eff}}}\)When you produce more effective datapoints, how much does the next model improve?

    If datapoints are noisy, non-reproducible, or poorly attributed, scale buys little capability. In the short run (say, 2026), I don’t think agents materially move this term; it’s largely a function of researcher talent density, org structure, and leadership—i.e., how diverse ideas are generated and how evidence gets aggregated into decisions.

Multiply them and you get \(G\), the loop’s ability to self-accelerate.

  • Small \(G\) → incrementalism, flattening, treadmill vibes.
  • Large \(G\) → compounding: each turn of the loop makes the next turn faster/stronger.

Comments

Thoughts, disagreements, and edge cases welcome.

Comments are temporarily unavailable.