Cutting LLM Costs with Prompt Caching and Smart Context Management

Token costs compound fast. A retrieval-augmented system that stuffs 50,000 tokens of context into every request will burn through your budget in hours at scale. Prompt caching is the most effective lever available to reduce that cost, but most teams leave it on the table because they misunderstand how it works or make small structural mistakes that quietly disable it.

This guide covers the mechanics of prefix-based caching, how to structure prompts to take full advantage of it, and the common failure modes to watch for when using Anthropic’s Claude models.

How Prefix Caching Actually Works

The fundamental idea is simple: if the beginning of your prompt is identical to a previous request, the model can reuse the computed key-value (KV) cache from that earlier call instead of reprocessing those tokens from scratch. Processing tokens is expensive. Reading from cache is cheap.

The critical word is prefix. Caching is not content-aware in a semantic sense. The system does not recognize that two paragraphs mean the same thing. It does a byte-level match from the start of the prompt forward. The moment any token differs from the cached version, the cache miss begins at that point and every token after it must be recomputed.

This has a direct architectural implication: stable content must come first, dynamic content must come last. If you place a user’s query at the top of your prompt and your large system instructions below it, you will never get a cache hit on those instructions, because every new query changes the prefix before the instructions are even reached.

Structuring Your Prompts for Maximum Cache Reuse

Think of your prompt as a stack of layers sorted by how frequently they change:

Static system instructions (never changes between requests)
Retrieved documents or knowledge base chunks (changes per topic or session, not per turn)
Conversation history (grows each turn)
Current user message (changes every request)

This ordering maximizes the length of the cacheable prefix. If your static system prompt is 10,000 tokens and it always appears first and unchanged, those 10,000 tokens are paid for once and reused on every subsequent call in that cache window.

A practical example using the Anthropic Python SDK with Claude looks like this:

import anthropic

client = anthropic.Anthropic()

SYSTEM_INSTRUCTIONS = """
You are a technical support assistant for Acme Corp.
[... thousands of tokens of policy docs, troubleshooting trees,
product documentation, and behavioral guidelines ...]
"""

response = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_INSTRUCTIONS,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "My printer is showing error code E-42."}
    ]
)

The cache_control block marks that content as a cache breakpoint. On the first call the full system prompt is processed and cached. On subsequent calls within the cache window, you pay only for the new user message tokens plus a much smaller cache read fee.

Using Adaptive Thinking Without Breaking Your Cache

Claude models at version 4.6 and above support adaptive thinking, which you enable by passing {"type": "adaptive"} in the thinking configuration. The old fixed budget_tokens parameter is no longer used on these models.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=8000,
    thinking={"type": "adaptive"},
    system=[
        {
            "type": "text",
            "text": SYSTEM_INSTRUCTIONS,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "Analyze the performance tradeoffs in this architecture..."}
    ]
)

Thinking tokens themselves are not cached in the same way as prompt tokens. Structure your prompt so the large static blocks are still positioned before the dynamic content, and adaptive thinking will operate on top of a warm cache without forcing you to pay full input token prices on every reasoning-heavy request.

The Silent Mistakes That Disable Caching

These are the failure modes that produce no error, no warning, and no obvious sign anything is wrong. Your bill simply never comes down.

Injecting dynamic content into the prefix

The most common mistake is inserting a timestamp, request ID, user name, or session token at the top of the system prompt. It feels harmless:

# This destroys your cache hit rate
system_prompt = f"Request ID: {request_id}\nTimestamp: {datetime.now()}\n\n{STATIC_INSTRUCTIONS}"

Every request now has a unique prefix. Zero cache hits. Move all dynamic context to the end of the prompt or into the user message turn.

Rebuilding the prompt object on every request

If your code constructs the system prompt string fresh on every call and the source data has changed at all, even in whitespace or punctuation, the cache key will differ. Pin your static prompt content to a versioned constant or load it from a file at startup, not at request time.

Randomized or shuffled retrieval chunks

Retrieval-augmented generation pipelines often return documents in a non-deterministic order. If your retrieved context changes order between calls even when the underlying documents are the same, the prefix after your system instructions will be different every time. Sort your retrieved chunks by a stable key (document ID, relevance score rounded to a fixed precision) before inserting them into the prompt.

Ignoring context window size differences

Claude Haiku 4.5 has a 200K-token context window. Claude Opus 4.8, Sonnet 4.6, and Fable 5 each have a 1M-token window. If you are working with very large cached prefixes, verify that your total prompt length including the cache block fits comfortably within the target model’s context limit. A prompt designed for a 1M-window model will fail or behave unexpectedly if you switch to Haiku without checking your token counts.

Caching Conversation History in Multi-Turn Applications

In a long chat session, the conversation history grows with every turn. You can apply a cache breakpoint to the conversation history up to the most recent exchange, then append the new user message after it.

messages = [
    # All previous turns, marked as a cache candidate
    *[
        {"role": turn["role"], "content": [
            {"type": "text", "text": turn["content"],
             "cache_control": {"type": "ephemeral"}}
        ]}
        for turn in previous_turns[-1:]  # cache breakpoint on last assistant turn
    ],
    # New user message, not cached yet
    {"role": "user", "content": new_user_message}
]

The strategy here is to push the cache breakpoint as close to the current turn as possible, so the model reads the accumulated history from cache and only processes the new message fresh. This matters most in long agentic sessions where context grows to tens of thousands of tokens.

Measuring Whether Caching Is Working

The Anthropic API returns cache-related token counts in the response usage object. Inspect these fields in your integration:

cache_creation_input_tokens: tokens written to cache on this request (first occurrence, higher cost)
cache_read_input_tokens: tokens read from cache (lower cost)
input_tokens: tokens processed normally (full price)

Log these values per request in your observability layer. If cache_read_input_tokens is consistently zero on requests that should be hitting your static system prompt, work through the silent-mistake checklist above. Most of the time the culprit is dynamic content injected into the prefix.

Choosing the Right Model for Cached Workloads

The economics of caching interact with model pricing. For high-volume workloads where the cached prefix is large and the per-turn query is short, a smaller model like Claude Haiku 4.5 will often be the right choice: the cache read cost is lower in absolute terms, and the 200K context window is sufficient for most RAG pipelines. Reserve Claude Sonnet 4.6 or Claude Fable 5 for tasks that genuinely require higher reasoning capability, where the cache savings help offset the higher base cost.

Takeaway

Prompt caching is not a configuration knob you flip once and forget. It is a design constraint that should shape your prompt architecture from the start. Keep stable content at the front, dynamic content at the back, sort your retrieval results deterministically, and instrument your token usage so you can verify the cache is actually being hit. Get those fundamentals right and the cost reductions will follow without any changes to model selection or infrastructure.