Jon Moshier / Notes / LLM Energy Use draft
Note · From the Notebook

LLM Energy Use

What it actually costs in energy to train and run large language models, why per-query numbers vary by orders of magnitude, and why inference now dominates the footprint.

LLM Energy Use

The energy a language model consumes splits into two phases with very different shapes: a one-time training burn, and a per-query inference cost paid forever. The headline numbers people cite are usually training figures, but inference is where the lifetime energy goes, and it is the part almost no one measures honestly.

Training Is a One-Time Spike

Training a frontier model is a discrete, enormous event. GPT-4 was, by leaked estimates, trained on roughly 25,000 Nvidia A100 GPUs for 90 to 100 days, around 57 million GPU-hours. That works out to somewhere between 51,772 and 62,318 MWh of electricity and an estimated 15,000+ tons of CO2e, comparable to the annual emissions of about 938 average Americans.

The carbon figure is far softer than the energy figure. The same training run can vary by a factor of 13 in emissions depending on which Azure region it runs in, because grid carbon intensity differs that much between locations. Energy is physics. Carbon is an accounting choice about where and when you draw power.

These numbers are educated guesswork. They rest on unverified GPU counts and assumed values for power usage effectiveness, hardware utilization, and grid intensity. Treat one-significant-figure precision as the ceiling.

Inference Is Where the Energy Lives

A model is trained once and queried billions of times. Over a deployed model’s life, inference dominates total energy. This is the part that compounds, and the part with no standard measurement.

In August 2025 Google published the first detailed per-prompt disclosure from a major provider: the median Gemini text prompt uses 0.24 watt-hours, emits 0.03 gCO2e, and consumes 0.26 mL of water. Roughly the energy of running a microwave for one second. Google also reported the median prompt used 33x more energy a year earlier, in May 2024, a real efficiency gain from better serving and model optimization.

Two caveats make that 0.24 Wh number less clean than it looks. First, the scope excludes model training, end-user device energy, network transport, and data storage. It counts TPU, host CPU and DRAM, idle provisioning, and data center overhead. Second, the emissions figure uses a market-based estimate that credits Google’s clean-energy purchases rather than the actual grid mix where the compute ran. A location-based number would be higher.

Why Per-Query Numbers Span Two Orders of Magnitude

“How much energy does a prompt use” has no single answer because three things move the number hard:

So the honest range for a single query is roughly 0.2 Wh to several Wh, depending on which model and what you asked it. A blanket “ChatGPT uses X per query” claim is almost always quoting one configuration as if it were universal.

The Disclosure Gap

Google’s report is notable mostly because it is nearly alone. Most providers publish nothing usable, and the few public numbers use incompatible boundaries (some include training amortization, some don’t; some use market-based carbon, some location-based). Without standardized scopes the figures aren’t comparable, which is convenient for anyone wanting to quote the lowest one. This is the same dynamic as Openwashing: a favorable metric, selectively scoped, presented as the whole picture.

Per-query efficiency is also not the same as total impact. Even as each Gemini prompt got 33x cheaper, Google’s overall carbon footprint rose 48% since 2019, driven by AI buildout. Cheaper per query plus vastly more queries is the Jevons Paradox in action, and it is why LLM energy use feeds directly into Data Center Externalities.

Try it

Measure your own inference energy (1-2 hours, Python + a local model). Run an open-weight model with Ollama and log GPU power with nvidia-smi --query-gpu=power.draw --format=csv -lms 200 (or powermetrics on Apple Silicon) while you send prompts. Compare a 7B vs a 70B model, and a short reply vs a long reasoning trace. You should see per-token power hold roughly flat but total energy track output length, and the bigger model draw far more per token, replicating the super-linear scaling described above.

See also

Sources

← All notes Read recent essays →