Prompt Caching Strategies

Knowing that prompt caching exists is the easy part. Designing prompts to maximize cache hits across realistic workloads takes practice.

This page goes deeper than the mechanics into the strategy.

The core insight

Caching is prefix-based. Maximize cache hits by:

1. Maximizing the length of the stable prefix

2. Minimizing variation in that prefix

3. Pushing variation to the end

This single principle drives most of the strategy.

Strategy 1: Layer your prompt

Think in layers from most stable to most variable:

| Layer | Stability | Update frequency |

|-------|-----------|------------------|

| Constants | very stable | rarely |

| System prompt | stable | days/weeks |

| Examples | semi-stable | weeks/months |

| Tool definitions | semi-stable | per-app |

| Document context | per-conversation | per-conversation |

| Conversation history | grows | per-turn |

| Current query | unique | per-request |

Each layer can have its own cache breakpoint (where supported).

Strategy 2: Stabilize the volatile

Some content seems variable but can be stabilized:

Date/time

Don't put exact timestamp in prompt. Use coarse "2026-04" or omit if not essential.

User identifiers

If only used for personalization, move to dedicated section after the cacheable prefix.

Random sampling

Examples chosen at random invalidate the cache. Use stable example sets.

Live data

Push to suffix where possible.

Strategy 3: Static-then-dynamic structure

Always organize as:

```

[stable prefix - long, cacheable]

---

[volatile suffix - short, unique]

```

This works regardless of provider.

Strategy 4: Multi-tier caching

For applications with multiple stable layers, use multiple cache breakpoints.

Anthropic supports up to 4 cache control markers.

```

[system: 5K tokens] --- cache 1

[examples: 3K tokens] --- cache 2

[tools: 2K tokens] --- cache 3

[user input: variable] (not cached)

```

When examples change, cache 1 still hits.

When system changes, all caches invalidate.

Order from most-stable to least.

Strategy 5: Document-anchored conversations

For chatting about a document:

```

[system prompt]

[document content] --- cache here

[turn 1]

[turn 2]

...

```

The document acts as a long stable suffix. Each new turn benefits from caching the document.

Strategy 6: Few-shot example management

Few-shot examples are great for quality but bloat prompts.

Stable examples

Use a fixed set of N examples. Cache them.

Dynamic examples

Selecting examples per query (e.g., from a few-shot index) defeats caching.

Compromise: cluster queries; cache examples per cluster.

Example versioning

Avoid reformatting examples without need. Whitespace changes invalidate.

Strategy 7: System prompt versioning

Keep system prompts in version control. Treat changes as deployments.

Bad: editing the system prompt for each conversation.

Good: stable system prompt; per-conversation parameters in conversation.

Strategy 8: Dynamic instructions in suffix

Use cases where instructions vary by user/context:

Bad placement (defeats caching):

```

You are helping {user_name}, who prefers {communication_style}...

[long stable content]

[query]

```

Better:

```

[long stable content]

User context: name={user_name}, style={communication_style}

Query: ...

```

The user-specific bits move to the variable region.

Strategy 9: Conversation history caching

For multi-turn conversations:

```

[system + tools]

[turn 1 user]

[turn 1 assistant]

[turn 2 user]

[turn 2 assistant]

[turn 3 user] <-- new

```

Each turn extends the cached prefix. Most providers cache up to the last assistant message.

For long conversations, cache hits stay high until conversation grows beyond cache TTL.

Strategy 10: Refresh cached prefixes

Caches expire (5 min default). For low-traffic apps, the cache may be cold by next request.

Mitigation:

- Periodic warm-up requests

- Use 1-hour cache where available (Anthropic)

- Co-locate caching with traffic patterns

Anti-patterns

Random ordering

Reordering tool definitions defeats caching. Sort consistently.

Trailing whitespace

Different trailing whitespace = different cache key.

Locale-dependent rendering

Generating prompts via templates that vary subtly by environment.

Per-request tweaks

Adding "Today is X" each request invalidates the prefix.

Hash-based shuffling

Some apps shuffle examples by user_id hash. Defeats caching.

Measuring success

Key metrics:

Cache hit rate

Hit rate by request type. Investigate low-hit categories.

Cached tokens per request

Average count tells you how much work is being saved.

Cost reduction

Compare cached cost to uncached cost. Some apps see 90%+ reduction.

Latency

Cache hits also reduce time to first token.

Common patterns

Customer support

Per-customer system prompt with policies + tools + examples.

Cache: everything except current ticket text. ~95% hits.

Code assistants

System + recent code context.

Cache: system + opened files. New question = new suffix.

RAG with conversation

Document set + history + new query.

Cache: document set (if same per session) + history.

Agent workflows

System + tools + ReAct trace so far.

Cache: extends with each action.

Provider-specific notes

Anthropic

- 4 explicit breakpoints

- 5-min and 1-hour TTLs

- Pay 25% premium on first write

- Manual control = predictable behavior

OpenAI

- Automatic

- ≥1024 token threshold

- ~50% discount on cached tokens

- Less control; usually just works

Self-hosted

- vLLM: automatic prefix caching

- Memory-bound — tune to traffic

- Visible cache stats

Cost-benefit analysis

Cache pays back when:

(uncached_cost - cached_cost) × hits > write_cost × cache_lifecycle

For long prompts and high reuse: ratio is dramatic.

For short prompts: write premium may exceed savings.

Iteration

When cache hit rate is lower than expected:

1. Log full prompts (sanitized)

2. Diff between requests

3. Find variable content in the prefix

4. Move variable content to suffix

5. Re-measure

Hit rate should approach 95%+ for well-structured prompts.

Edge cases

Tool result format changes

Tool results in conversation history affect the prefix from that point on.

Stable formatting helps caching subsequent turns.

Streaming

Caching is on prefix processing. Streaming output independent.

Errors mid-conversation

Failed turns may leave inconsistent state. Decide whether to retry from cached prefix or restart.

Multi-region

Cache is typically per-region. Failover may invalidate.

Where this is going

- Smarter automatic detection of cacheable content

- Cross-request semantic caching (similar but not identical prompts)

- Longer TTLs

- Tighter integration with retrieval

For now: structure prompts deliberately. Big payoff for moderate effort.

Further Reading

- [PromptCaching](PromptCaching) — Mechanics

- [AgentPromptEngineering](AgentPromptEngineering) — Agent patterns

- [Generative AI Hub](GenerativeAIHub) — Cluster index