Test-Time Compute
For most of the deep learning era, a model’s quality was fixed the moment training finished. Inference was cheap and constant: one forward pass per token. Test-time compute breaks that assumption. Instead of making the model bigger, you let it spend more compute at the moment of answering, generating, searching, and revising before it commits. This is the engine behind the 2024-2025 reasoning models, and it changes the cost structure of AI in ways that connect directly to LLM Energy Use.
The Core Trade
The foundational result came from Snell et al. (2024), who studied how to allocate inference compute optimally and found that test-time compute can substitute for parameters. On some problems, a smaller model thinking longer beats a much larger model answering immediately, in their setting a model using extra test-time compute matched one roughly 14x larger. Training and inference compute are partially fungible: you can buy capability by training a bigger model, or by letting a smaller one deliberate.
This is why OpenAI’s o1 and o3, DeepSeek R1, and Gemini 2.5 are described as “System-2” models. They are not necessarily larger than their predecessors. They are trained to use a long internal chain of reasoning tokens before the visible answer, and their accuracy climbs as that budget grows.
How the Compute Gets Spent
Several methods turn extra inference compute into better answers. They split into two families:
- Parallel sampling. Generate many candidate answers, then pick one. Best-of-N uses a verifier or reward model to score candidates. Self-consistency takes a majority vote across reasoning paths. More samples, better odds of a correct one surfacing.
- Sequential deliberation. Let the model revise its own work. This includes self-correction, and search procedures like beam search or Monte Carlo Tree Search over reasoning steps, often guided by a process reward model that scores intermediate steps rather than just the final answer.
The compute-optimal strategy mixes these depending on difficulty. Easy problems do well with a little sequential revision; hard problems benefit from broad parallel search. Spending the budget the same way on every problem wastes it.
The Cost Curve Is Brutal
The catch is that returns diminish hard, and cost does not. OpenAI’s o3 on the ARC-AGI benchmark is the clearest public example. The high-compute configuration used 172x more compute than the low one. The numbers: 75.7% accuracy at about $26 per task using 6 samples, versus 87.5% at roughly $4,560 per task using 1,024 samples. A 175x increase in cost bought 12 percentage points. The Arc Prize Foundation estimated the highest configuration at thousands of dollars per task, with a single problem consuming tens of millions of tokens as the search explores and backtracks through solution space.
So test-time compute is a dial, not a free lunch. Each turn of the dial costs more for less. Whether it is worth it depends entirely on the value of the marginal correct answer, which is why it makes sense for frontier math and code and not for summarizing an email.
Why It Matters for Energy and Economics
Reasoning tokens are the multiplier in the energy story. A test-time-heavy query can cost 10x to 1000x a simple one, because energy is paid per generated token and these models generate enormous hidden traces. That moves the lifetime footprint of a model decisively toward inference and complicates every per-query efficiency claim, since the same model can be cheap or staggeringly expensive depending on how hard you let it think. It also reshapes [private link]: the cost of a “query” is no longer a property of the model alone but of the compute budget attached to it.
And it feeds the Jevons Paradox loop. As reasoning gets cheaper per token, it gets attached to more tasks, especially autonomous agents that run long chains unattended, so total inference compute rises even as each token gets more efficient. This is the same dynamic that keeps the subsidized pricing underwater. Architecturally, sparse approaches like Mixture of Experts cut the cost per token, which partially offsets but does not cap the test-time growth.
Try it
Watch accuracy climb with samples (1-2 hours, Python + any chat API). Take 20 hard math problems (e.g. from MATH or AIME). For each, sample the model k = 1, 4, 16, 64 times at temperature ~0.8 and take the majority answer (self-consistency). Plot accuracy vs k. You should see a rising, concave curve, real gains early, flattening fast, and you can multiply k by the per-call token cost to see your own version of the o3 cost curve. Swap majority vote for “best-of-N scored by asking the model to grade its own answers” to feel the difference a verifier makes.
See also
- LLM Energy Use — where reasoning tokens dominate the per-query footprint
- [private link] — test-time compute as a cost variable, not a constant
- Mixture of Experts — cutting cost per token from the architecture side
Sources
- A Survey of Test-Time Compute (arXiv 2501.02497) — methods taxonomy and the Snell compute-optimal result
- OpenAI o3 breakthrough on ARC-AGI (ARC Prize) — the accuracy-vs-cost numbers
- o3 may be costlier than estimated (TechCrunch) — per-task cost estimates