Prompt Caching

Prompt caching lets LLMs reuse computation for repeated prefixes. New requests sharing a prefix with previously-cached requests skip most of the computation, dramatically reducing cost and latency.

For applications with long static prompts (system prompts, RAG, few-shot examples), prompt caching can cut cost 90%+.

Why caching helps

LLMs process prompts as a sequence. Each token's computation depends on all previous tokens (causal attention).

For a prompt of N tokens, the model performs O(N²) work to set up generation.

If two prompts share the first M tokens, the model could reuse the computation for those M tokens — if it has the cached state.

What providers offer

Anthropic (Claude)

Opt-in via `cache_control` markers. Cached prefixes:

- 90% cheaper to read

- 25% more expensive to write

- 5-minute TTL by default; 1-hour available

Mark up to 4 cache breakpoints in your prompt.

OpenAI

Automatic for prompts ≥1024 tokens. No code changes needed.

- 50% discount on cached tokens

- ~5-10 minute TTL

Google (Gemini)

Explicit context caching. Create cache, reuse it.

- Pay storage per hour

- Pay reduced rate for cache reads

Open source / self-hosted

vLLM and TGI support prefix caching automatically.

- Memory-bound: cache fits while serving fits memory

- LRU eviction

Self-hosting gives the most control.

What to cache

System prompts

Long system prompts repeat across requests. Major win.

If your system prompt is 5K tokens and you handle 1M requests/day, that's 5B input tokens/day before user content. Caching collapses this to ~1 cache write + 1M cache reads.

Few-shot examples

Examples in the prompt repeat. Cache them.

RAG context

If chunks recur across queries, caching helps.

For typical RAG: each query retrieves different chunks. Caching is more useful for:

- Conversation history

- Recently retrieved chunks

- Document-specific contexts

Long documents (multiple queries)

Asking many questions about the same document? Cache the document.

Tool definitions

Tool schemas in agent prompts can be lengthy. Cache them.

Conversation history

For multi-turn: cache history, append new turn.

What NOT to cache

Highly variable content

If every request has unique content at the start, caching doesn't help.

Short prompts

Below provider thresholds, caching doesn't apply (or doesn't pay back the write cost).

Privacy-sensitive content

Caching may share infrastructure across requests. Verify provider isolation guarantees.

Structuring prompts for caching

The key insight: caching works on prefixes. Put stable content first, variable content last.

Bad order

```

[user query]

[system prompt]

[examples]

```

System prompt repeats but isn't a prefix. No caching benefit.

Good order

```

[system prompt]

[examples]

[tool definitions]

[user query]

```

Stable prefix; variable suffix. Cache hits on every request.

Multiple cache levels

Some providers allow multiple cache breakpoints:

```

[system prompt] -- cache point 1 (rarely changes)

[examples] -- cache point 2 (occasionally changes)

[tool definitions] -- cache point 3 (changes per app)

[conversation history] -- cache point 4 (grows per conversation)

[user query] -- not cached (always new)

```

Each cache point can be reused independently when content changes downstream.

Anthropic-specific patterns

```python

messages = [

{

"role": "user",

"content": [

{

"type": "text",

"text": SYSTEM_PROMPT,

"cache_control": {"type": "ephemeral"}

},

{

"type": "text",

"text": user_query

}

]

}

]

```

Cache markers tell Anthropic what to cache. The system measures the prefix up to each marker.

OpenAI-specific patterns

Automatic. Just reuse the same prefix consistently.

Hit rate is typically:

- 100% for identical prompts within TTL

- High for shared prefixes

- Low for highly varied content

Self-hosted (vLLM)

Prefix caching is enabled by default.

Considerations:

- KV cache memory pressure

- Eviction policies

- Cache hit rate visible in metrics

For high-traffic deployments, prefix caching dramatically improves throughput.

Measuring cache effectiveness

Track:

- **Cache hit rate**: % of requests using cached content

- **Cached tokens / total tokens**: portion of work saved

- **Cost savings**: actual billing reduction

- **Latency improvement**: caches make responses faster too

If hit rate is low, prompt structure may need work.

Cost math

Without caching:

- 5K system prompt × $3/M tokens × 1M requests = $15K/day

With Anthropic caching (90% read discount, 25% write premium):

- Cache write: 5K × $3.75/M × 1 = ~$0

- Cache reads: 5K × $0.30/M × 1M = $1.5K/day

- ~90% savings

Common failure patterns

Variable prefix

System prompt has dynamic content (date, user ID) at the start. No cache hits.

Fix: move dynamic content to suffix.

Frequent invalidation

System prompt changes constantly during development. Cache write cost dominates.

Stabilize prompts before relying on caching.

Inconsistent formatting

Whitespace differences invalidate cache. Be exactly consistent.

TTL surprises

Cache expires at 5 min idle. Bursty traffic patterns may not benefit.

Memory pressure (self-hosted)

KV cache competes with batch capacity. May need to tune.

Cache coherence

Multiple deployment instances may have different caches. Each instance warms separately.

Practical workflow

1. Identify stable prefix in your prompts (system, examples, tools)

2. Structure prompts: stable first, variable last

3. Add cache markers (Anthropic) or rely on auto (OpenAI)

4. Monitor hit rate

5. Iterate on prompt structure if hit rate is low

Edge cases

Prompt updates

When system prompt changes, all caches invalidate.

For high-volume apps, batch prompt updates rather than rolling out gradually.

Rate limiting

Cached requests use less compute but still count against rate limits.

Multi-tenant

Per-customer caching may be needed. Some providers support isolation; verify.

Streaming

Caching applies to prefix processing. Streaming the response is independent.

Where this is going

- Longer cache TTLs

- More automatic caching

- Better cache management for self-hosted

- Cross-request optimization beyond simple prefix caching

The cost dynamics of LLM inference favor caching. Expect more sophistication.

Further Reading

- [PromptCachingStrategies](PromptCachingStrategies) — Strategy details

- [AgentPromptEngineering](AgentPromptEngineering) — Agent patterns

- [OpenSourceLlmEcosystem](OpenSourceLlmEcosystem) — Self-hosted options

- [Generative AI Hub](GenerativeAIHub) — Cluster index