Prompt Caching Strategies
Knowing that prompt caching exists is the easy part. Designing prompts to maximize cache hits across realistic workloads takes practice.
This page goes deeper than the mechanics into the strategy.
The core insight
Caching is prefix-based. Maximize cache hits by:
1. Maximizing the length of the stable prefix
2. Minimizing variation in that prefix
3. Pushing variation to the end
This single principle drives most of the strategy.
Strategy 1: Layer your prompt
Think in layers from most stable to most variable:
| Layer | Stability | Update frequency |
|-------|-----------|------------------|
| Constants | very stable | rarely |
| System prompt | stable | days/weeks |
| Examples | semi-stable | weeks/months |
| Tool definitions | semi-stable | per-app |
| Document context | per-conversation | per-conversation |
| Conversation history | grows | per-turn |
| Current query | unique | per-request |
Each layer can have its own cache breakpoint (where supported).
Strategy 2: Stabilize the volatile
Some content seems variable but can be stabilized:
Date/time
Don't put exact timestamp in prompt. Use coarse "2026-04" or omit if not essential.
User identifiers
If only used for personalization, move to dedicated section after the cacheable prefix.
Random sampling
Examples chosen at random invalidate the cache. Use stable example sets.
Live data
Push to suffix where possible.
Strategy 3: Static-then-dynamic structure
Always organize as:
```
[stable prefix - long, cacheable]
---
[volatile suffix - short, unique]
```
This works regardless of provider.
Strategy 4: Multi-tier caching
For applications with multiple stable layers, use multiple cache breakpoints.
Anthropic supports up to 4 cache control markers.
```
[system: 5K tokens] --- cache 1
[examples: 3K tokens] --- cache 2
[tools: 2K tokens] --- cache 3
[user input: variable] (not cached)
```
When examples change, cache 1 still hits.
When system changes, all caches invalidate.
Order from most-stable to least.
Strategy 5: Document-anchored conversations
For chatting about a document:
```
[system prompt]
[document content] --- cache here
[turn 1]
[turn 2]
...
```
The document acts as a long stable suffix. Each new turn benefits from caching the document.
Strategy 6: Few-shot example management
Few-shot examples are great for quality but bloat prompts.
Stable examples
Use a fixed set of N examples. Cache them.
Dynamic examples
Selecting examples per query (e.g., from a few-shot index) defeats caching.
Compromise: cluster queries; cache examples per cluster.
Example versioning
Avoid reformatting examples without need. Whitespace changes invalidate.
Strategy 7: System prompt versioning
Keep system prompts in version control. Treat changes as deployments.
Bad: editing the system prompt for each conversation.
Good: stable system prompt; per-conversation parameters in conversation.
Strategy 8: Dynamic instructions in suffix
Use cases where instructions vary by user/context:
Bad placement (defeats caching):
```
You are helping {user_name}, who prefers {communication_style}...
[long stable content]
[query]
```
Better:
```
[long stable content]
User context: name={user_name}, style={communication_style}
Query: ...
```
The user-specific bits move to the variable region.
Strategy 9: Conversation history caching
For multi-turn conversations:
```
[system + tools]
[turn 1 user]
[turn 1 assistant]
[turn 2 user]
[turn 2 assistant]
[turn 3 user] <-- new
```
Each turn extends the cached prefix. Most providers cache up to the last assistant message.
For long conversations, cache hits stay high until conversation grows beyond cache TTL.
Strategy 10: Refresh cached prefixes
Caches expire (5 min default). For low-traffic apps, the cache may be cold by next request.
Mitigation:
- Periodic warm-up requests
- Use 1-hour cache where available (Anthropic)
- Co-locate caching with traffic patterns
Anti-patterns
Random ordering
Reordering tool definitions defeats caching. Sort consistently.
Trailing whitespace
Different trailing whitespace = different cache key.
Locale-dependent rendering
Generating prompts via templates that vary subtly by environment.
Per-request tweaks
Adding "Today is X" each request invalidates the prefix.
Hash-based shuffling
Some apps shuffle examples by user_id hash. Defeats caching.
Measuring success
Key metrics:
Cache hit rate
Hit rate by request type. Investigate low-hit categories.
Cached tokens per request
Average count tells you how much work is being saved.
Cost reduction
Compare cached cost to uncached cost. Some apps see 90%+ reduction.
Latency
Cache hits also reduce time to first token.
Common patterns
Customer support
Per-customer system prompt with policies + tools + examples.
Cache: everything except current ticket text. ~95% hits.
Code assistants
System + recent code context.
Cache: system + opened files. New question = new suffix.
RAG with conversation
Document set + history + new query.
Cache: document set (if same per session) + history.
Agent workflows
System + tools + ReAct trace so far.
Cache: extends with each action.
Provider-specific notes
Anthropic
- 4 explicit breakpoints
- 5-min and 1-hour TTLs
- Pay 25% premium on first write
- Manual control = predictable behavior
OpenAI
- Automatic
- ≥1024 token threshold
- ~50% discount on cached tokens
- Less control; usually just works
Self-hosted
- vLLM: automatic prefix caching
- Memory-bound — tune to traffic
- Visible cache stats
Cost-benefit analysis
Cache pays back when:
(uncached_cost - cached_cost) × hits > write_cost × cache_lifecycle
For long prompts and high reuse: ratio is dramatic.
For short prompts: write premium may exceed savings.
Iteration
When cache hit rate is lower than expected:
1. Log full prompts (sanitized)
2. Diff between requests
3. Find variable content in the prefix
4. Move variable content to suffix
5. Re-measure
Hit rate should approach 95%+ for well-structured prompts.
Edge cases
Tool result format changes
Tool results in conversation history affect the prefix from that point on.
Stable formatting helps caching subsequent turns.
Streaming
Caching is on prefix processing. Streaming output independent.
Errors mid-conversation
Failed turns may leave inconsistent state. Decide whether to retry from cached prefix or restart.
Multi-region
Cache is typically per-region. Failover may invalidate.
Where this is going
- Smarter automatic detection of cacheable content
- Cross-request semantic caching (similar but not identical prompts)
- Longer TTLs
- Tighter integration with retrieval
For now: structure prompts deliberately. Big payoff for moderate effort.
Further Reading
- [PromptCaching](PromptCaching) — Mechanics
- [AgentPromptEngineering](AgentPromptEngineering) — Agent patterns
- [Generative AI Hub](GenerativeAIHub) — Cluster index