LLM Energy Use
The energy a language model consumes splits into two phases with very different shapes: a one-time training burn, and a per-query inference cost paid forever. The headline numbers people cite are usually training figures, but inference is where the lifetime energy goes, and it is the part almost no one measures honestly.
Training Is a One-Time Spike
Training a frontier model is a discrete, enormous event. GPT-4 was, by leaked estimates, trained on roughly 25,000 Nvidia A100 GPUs for 90 to 100 days, around 57 million GPU-hours. That works out to somewhere between 51,772 and 62,318 MWh of electricity and an estimated 15,000+ tons of CO2e, comparable to the annual emissions of about 938 average Americans.
The carbon figure is far softer than the energy figure. The same training run can vary by a factor of 13 in emissions depending on which Azure region it runs in, because grid carbon intensity differs that much between locations. Energy is physics. Carbon is an accounting choice about where and when you draw power.
These numbers are educated guesswork. They rest on unverified GPU counts and assumed values for power usage effectiveness, hardware utilization, and grid intensity. Treat one-significant-figure precision as the ceiling.
Inference Is Where the Energy Lives
A model is trained once and queried billions of times. Over a deployed model’s life, inference dominates total energy. This is the part that compounds, and the part with no standard measurement.
In August 2025 Google published the first detailed per-prompt disclosure from a major provider: the median Gemini text prompt uses 0.24 watt-hours, emits 0.03 gCO2e, and consumes 0.26 mL of water. Roughly the energy of running a microwave for one second. Google also reported the median prompt used 33x more energy a year earlier, in May 2024, a real efficiency gain from better serving and model optimization.
Two caveats make that 0.24 Wh number less clean than it looks. First, the scope excludes model training, end-user device energy, network transport, and data storage. It counts TPU, host CPU and DRAM, idle provisioning, and data center overhead. Second, the emissions figure uses a market-based estimate that credits Google’s clean-energy purchases rather than the actual grid mix where the compute ran. A location-based number would be higher.
Why Per-Query Numbers Span Two Orders of Magnitude
“How much energy does a prompt use” has no single answer because three things move the number hard:
- Model size. Per-token energy scales super-linearly with parameters. Going from a 7B to a 70B model raises per-token energy by roughly 100x, not 10x.
- Output length and reasoning. Energy is paid per generated token. A reasoning model that emits thousands of internal tokens before answering costs far more than a one-line reply. DeepSeek R1 measured 0.96 to 3.74 Wh per query at interactive settings, 4x to 15x a median Gemini prompt.
- Prompt length. Self-attention is quadratic in input length, so a long context window is disproportionately expensive to process.
So the honest range for a single query is roughly 0.2 Wh to several Wh, depending on which model and what you asked it. A blanket “ChatGPT uses X per query” claim is almost always quoting one configuration as if it were universal.
The Disclosure Gap
Google’s report is notable mostly because it is nearly alone. Most providers publish nothing usable, and the few public numbers use incompatible boundaries (some include training amortization, some don’t; some use market-based carbon, some location-based). Without standardized scopes the figures aren’t comparable, which is convenient for anyone wanting to quote the lowest one. This is the same dynamic as Openwashing: a favorable metric, selectively scoped, presented as the whole picture.
Per-query efficiency is also not the same as total impact. Even as each Gemini prompt got 33x cheaper, Google’s overall carbon footprint rose 48% since 2019, driven by AI buildout. Cheaper per query plus vastly more queries is the Jevons Paradox in action, and it is why LLM energy use feeds directly into Data Center Externalities.
Try it
Measure your own inference energy (1-2 hours, Python + a local model). Run an open-weight model with Ollama and log GPU power with nvidia-smi --query-gpu=power.draw --format=csv -lms 200 (or powermetrics on Apple Silicon) while you send prompts. Compare a 7B vs a 70B model, and a short reply vs a long reasoning trace. You should see per-token power hold roughly flat but total energy track output length, and the bigger model draw far more per token, replicating the super-linear scaling described above.
See also
- The Cost Subsidization of LLM Use — energy is a large slice of the real inference cost being subsidized
- [private link]
Sources
- The carbon footprint of GPT-4 — bottom-up training estimate and the region-dependent carbon spread
- Google: Measuring the environmental impact of AI inference — the per-prompt methodology and its scope boundaries
- MIT Technology Review: Google’s per-prompt energy data — context and critique of the 0.24 Wh figure