Jon Moshier / Notes / Test-Time Compute draft
Note · From the Notebook

Test-Time Compute

The shift from spending compute at training to spending it at inference: how reasoning models think longer to answer better, the methods behind it, and the steep cost curve that follows.

Test-Time Compute

For most of the deep learning era, a model’s quality was fixed the moment training finished. Inference was cheap and constant: one forward pass per token. Test-time compute breaks that assumption. Instead of making the model bigger, you let it spend more compute at the moment of answering, generating, searching, and revising before it commits. This is the engine behind the 2024-2025 reasoning models, and it changes the cost structure of AI in ways that connect directly to LLM Energy Use.

The Core Trade

The foundational result came from Snell et al. (2024), who studied how to allocate inference compute optimally and found that test-time compute can substitute for parameters. On some problems, a smaller model thinking longer beats a much larger model answering immediately, in their setting a model using extra test-time compute matched one roughly 14x larger. Training and inference compute are partially fungible: you can buy capability by training a bigger model, or by letting a smaller one deliberate.

This is why OpenAI’s o1 and o3, DeepSeek R1, and Gemini 2.5 are described as “System-2” models. They are not necessarily larger than their predecessors. They are trained to use a long internal chain of reasoning tokens before the visible answer, and their accuracy climbs as that budget grows.

How the Compute Gets Spent

Several methods turn extra inference compute into better answers. They split into two families:

The compute-optimal strategy mixes these depending on difficulty. Easy problems do well with a little sequential revision; hard problems benefit from broad parallel search. Spending the budget the same way on every problem wastes it.

The Cost Curve Is Brutal

The catch is that returns diminish hard, and cost does not. OpenAI’s o3 on the ARC-AGI benchmark is the clearest public example. The high-compute configuration used 172x more compute than the low one. The numbers: 75.7% accuracy at about $26 per task using 6 samples, versus 87.5% at roughly $4,560 per task using 1,024 samples. A 175x increase in cost bought 12 percentage points. The Arc Prize Foundation estimated the highest configuration at thousands of dollars per task, with a single problem consuming tens of millions of tokens as the search explores and backtracks through solution space.

So test-time compute is a dial, not a free lunch. Each turn of the dial costs more for less. Whether it is worth it depends entirely on the value of the marginal correct answer, which is why it makes sense for frontier math and code and not for summarizing an email.

Why It Matters for Energy and Economics

Reasoning tokens are the multiplier in the energy story. A test-time-heavy query can cost 10x to 1000x a simple one, because energy is paid per generated token and these models generate enormous hidden traces. That moves the lifetime footprint of a model decisively toward inference and complicates every per-query efficiency claim, since the same model can be cheap or staggeringly expensive depending on how hard you let it think. It also reshapes [private link]: the cost of a “query” is no longer a property of the model alone but of the compute budget attached to it.

And it feeds the Jevons Paradox loop. As reasoning gets cheaper per token, it gets attached to more tasks, especially autonomous agents that run long chains unattended, so total inference compute rises even as each token gets more efficient. This is the same dynamic that keeps the subsidized pricing underwater. Architecturally, sparse approaches like Mixture of Experts cut the cost per token, which partially offsets but does not cap the test-time growth.

Try it

Watch accuracy climb with samples (1-2 hours, Python + any chat API). Take 20 hard math problems (e.g. from MATH or AIME). For each, sample the model k = 1, 4, 16, 64 times at temperature ~0.8 and take the majority answer (self-consistency). Plot accuracy vs k. You should see a rising, concave curve, real gains early, flattening fast, and you can multiply k by the per-call token cost to see your own version of the o3 cost curve. Swap majority vote for “best-of-N scored by asking the model to grade its own answers” to feel the difference a verifier makes.

See also

Sources

← All notes Read recent essays →