Anthropic Prompt Caching

Notes from implementing prompt caching in Pantheon to dodge Sonnet’s 30K input-TPM rate limit.

The mental model

Every API request to Claude sends the entire conversation again: system prompt + tool definitions + every prior message. The model re-reads all of it on every turn. That’s why a 10-turn chat costs roughly 10× more input tokens than a 1-turn chat — you’re paying for the same prefix N times.

Prompt caching changes the deal. You mark a prefix of the request as cacheable. The first time Anthropic sees it they store the model’s internal state at that prefix (the KV cache, essentially). On the next request that shares the exact same prefix, they restart from the stored state instead of re-reading the tokens.

Two benefits:

Cost. Cache reads are billed at ~10% of normal input. Cache writes are ~25% more than normal input (one-time), so anything reused twice pays for itself.
Rate limits. On the Console plan, cache reads don’t count against the input TPM limit — for example, Sonnet’s 30K input tokens/minute. This is the lever that matters when you’re getting 429’d, not when you’re worried about dollars.

How the prefix is structured

A request to Claude has a strict order:

system  →  tools  →  messages

You insert cache breakpoints into the request. A breakpoint says “the cache prefix ends here.” Up to 4 breakpoints per request. When Anthropic gets a new request, they look at it left-to-right and find the longest cached prefix that matches — they restart from there and only process the suffix.

Example layout:

[system block — cache_control: ephemeral]  ← breakpoint #1
[tool 1]
[tool 2]
...
[tool N — cache_control: ephemeral]         ← breakpoint #2
[user msg 1]
[assistant msg 1]
[user msg 2 — cache_control: ephemeral]     ← breakpoint #3 (optional)
[assistant msg 2]   ← only this is "new" on the next request

Where you put the breakpoint determines how much gets cached. Breakpoint at the system block alone = system cached. Breakpoint on the last tool definition = system + tools cached. Breakpoint on the last stable message = system + tools + history cached.

Constraints

Minimum prefix size. Sonnet/Opus: 1024 tokens. Haiku: 2048 tokens. Smaller than that and the breakpoint is ignored. So a tiny system prompt won’t cache.
TTL. ephemeral = 5 minutes from last hit. Idle longer than that, the cache is evicted and the next request is a full write.
Exact prefix match. A single character change anywhere before the breakpoint invalidates the cache. Including timestamps, dates, or user identifiers early in the system prompt breaks caching every request.
Up to 4 breakpoints. They’re tried longest-prefix-first, so older breakpoints stay live even after you add new ones — useful for the rolling-history case below.

The rolling history trick

In a long chat, you can keep extending the cached prefix forward:

Turn 1: cache system + tools. New = user msg 1 + assistant reply.
Turn 2: move the message-level breakpoint to after assistant reply 1. Now cache covers system + tools + msg 1 + reply 1. New = user msg 2 + reply.
Turn 3: move again. Cache grows.

Each request pays the cache-write premium on only the small new suffix, and reads everything before it cheaply. This is what makes long agentic sessions cost-feasible. In Pantheon we’re starting with just system + tools cached — adding message-level breakpoints is a follow-up if needed.

Verifying it’s working

The streaming response includes a usage field with:

cache_creation_input_tokens — wrote new entries this request
cache_read_input_tokens — restored from cache (these are the ones that don’t count toward TPM)
input_tokens — actually-new tokens

First request: high cache_creation, zero cache_read. Subsequent requests within 5 min: zero cache_creation, high cache_read. If cache_read stays at zero across requests, something is invalidating the prefix — usually a non-deterministic system-prompt component.

What this fixes in Pantheon

Symptom: Sonnet 4.6 returns 429 (rate limit) after a handful of turns despite the 1M-token context window. The wall isn’t context size — it’s the 30K input-TPM budget. By caching the system prompt (which on this project pulls in CLAUDE.md, README.md, .pantheon/context.md → easily several thousand tokens) and tool definitions, every subsequent request within 5 minutes pays effectively zero TPM for that prefix.

See also: multi-llm-coordinator-idea, ~/Projects/pantheon/docs/context-window-plan.md (the full plan for context-window fixes, item #8).

References

Anthropic docs: Prompt caching
Rate limits page — cache-read exclusion