Jon Moshier / Notes / Anthropic Prompt Caching draft
Note · From the Notebook

Anthropic Prompt Caching

Notes from implementing prompt caching in [[Pantheon]] to dodge Sonnet's 30K input-TPM rate limit

Anthropic Prompt Caching

Notes from implementing prompt caching in Pantheon to dodge Sonnet’s 30K input-TPM rate limit.

The mental model

Every API request to Claude sends the entire conversation again: system prompt + tool definitions + every prior message. The model re-reads all of it on every turn. That’s why a 10-turn chat costs roughly 10× more input tokens than a 1-turn chat — you’re paying for the same prefix N times.

Prompt caching changes the deal. You mark a prefix of the request as cacheable. The first time Anthropic sees it they store the model’s internal state at that prefix (the KV cache, essentially). On the next request that shares the exact same prefix, they restart from the stored state instead of re-reading the tokens.

Two benefits:

  1. Cost. Cache reads are billed at ~10% of normal input. Cache writes are ~25% more than normal input (one-time), so anything reused twice pays for itself.
  2. Rate limits. On the Console plan, cache reads don’t count against the input TPM limit — for example, Sonnet’s 30K input tokens/minute. This is the lever that matters when you’re getting 429’d, not when you’re worried about dollars.

How the prefix is structured

A request to Claude has a strict order:

system  →  tools  →  messages

You insert cache breakpoints into the request. A breakpoint says “the cache prefix ends here.” Up to 4 breakpoints per request. When Anthropic gets a new request, they look at it left-to-right and find the longest cached prefix that matches — they restart from there and only process the suffix.

Example layout:

[system block — cache_control: ephemeral]  ← breakpoint #1
[tool 1]
[tool 2]
...
[tool N — cache_control: ephemeral]         ← breakpoint #2
[user msg 1]
[assistant msg 1]
[user msg 2 — cache_control: ephemeral]     ← breakpoint #3 (optional)
[assistant msg 2]   ← only this is "new" on the next request

Where you put the breakpoint determines how much gets cached. Breakpoint at the system block alone = system cached. Breakpoint on the last tool definition = system + tools cached. Breakpoint on the last stable message = system + tools + history cached.

Constraints

The rolling history trick

In a long chat, you can keep extending the cached prefix forward:

Each request pays the cache-write premium on only the small new suffix, and reads everything before it cheaply. This is what makes long agentic sessions cost-feasible. In Pantheon we’re starting with just system + tools cached — adding message-level breakpoints is a follow-up if needed.

Verifying it’s working

The streaming response includes a usage field with:

First request: high cache_creation, zero cache_read. Subsequent requests within 5 min: zero cache_creation, high cache_read. If cache_read stays at zero across requests, something is invalidating the prefix — usually a non-deterministic system-prompt component.

What this fixes in Pantheon

Symptom: Sonnet 4.6 returns 429 (rate limit) after a handful of turns despite the 1M-token context window. The wall isn’t context size — it’s the 30K input-TPM budget. By caching the system prompt (which on this project pulls in CLAUDE.md, README.md, .pantheon/context.md → easily several thousand tokens) and tool definitions, every subsequent request within 5 minutes pays effectively zero TPM for that prefix.

See also: multi-llm-coordinator-idea, ~/Projects/pantheon/docs/context-window-plan.md (the full plan for context-window fixes, item #8).

References

← All notes Read recent essays →