Your documentation assistant ships quietly and works, answering how to rotate an API key, what the rate limits are, and how to invite a teammate, correctly and in seconds. The week the invoice lands, you pull the logs and find that about twenty questions account for most of the traffic, each asked hundreds of times in nearly the same words by people who will never meet. The questions are public, the answers barely changed all month, and every asker paid for the full run: the same system rules read again, the same pages retrieved again, the same answer produced again from scratch. The bill reads as if every asker were the first, and almost none of them were.
The three caches: prompt, response, and semantic
Caching means storing work you have already paid for and serving it again when the same need returns. AI products use it in three forms, and each form saves a different kind of repetition.
Prompt caching saves re-reading. Most of a production prompt never changes between calls: the system rules, the tool definitions, and the shared documents ride at the front of every request, while only the user's question changes at the tail. With prompt caching, the provider stores that unchanging front after the first call and reprocesses only the new tail, and tokens read from the cache bill at a small fraction of the normal input rate. The exact ratios move with provider pricing, so the current numbers live in the dated Price Sheet; the durable fact is that the discount on the cached part is large, and on long prompts the cached part is most of the input bill.
Response caching saves re-answering. When the same question arrives again, in exactly the same words and with nothing personal in it, you serve the stored answer and make no model call at all. The discount is total, and so is the constraint, because it fires only on exact repeats.
Semantic caching saves re-answering for near-duplicates. "How do I reset my password" and "how can I change my password if I forgot it" are different strings asking for the same answer. A semantic cache matches an incoming question to stored ones by meaning, usually by comparing embeddings (numeric fingerprints of what a text is about), and serves the stored answer when the match is close enough. It hits far more often than exact matching, and it is the riskiest of the three, because "close enough" is a judgment call: "how do I enable notifications" and "how do I disable notifications" sit almost on top of each other by meaning and deserve opposite answers.
The documentation assistant in the opening scene earns all three at once: prompt caching on the rules and pages every question shares, response caching on the twenty questions that arrive verbatim, and semantic caching on their reworded cousins.
What you can cache is a product decision
Which caches your product earns depends on what your users ask and how personal the answers are. Content caches well when it is stable and shared: the product's own rules, public documentation, policy text, anything where every asker deserves the same answer. Content refuses to cache when it is any of three things:
- Personalized. An email assistant drafting from one user's inbox, or a meeting tool summarizing one team's call, produces answers no other user should ever receive.
- Time-sensitive. A research tool reporting on this morning's filings, or a support bot quoting current wait times, answers questions whose correct answer moves faster than any sensible cache lifetime.
- Memory-dependent. An answer built on what the user said three turns ago belongs to that session and no other.
This is why prompt order matters: providers cache the unbroken front of a request, and the first personalized token ends the discount for everything after it.
Put the stable content first: the provider caches only the unchanging front of a prompt, so every volatile token you move toward the tail extends the discount.
This is the same discipline as the split in Context budgets: fit the right facts into a finite window, which allocated the window into a fixed portion and a variable portion. Caching pays you for keeping the fixed portion first and genuinely fixed.
Every cached answer is a freshness decision
A cache is a freshness decision wearing a discount: what you never pay for twice is also what you never re-check.
A stored answer was correct on the day it was stored, and serving it again is a bet that nothing relevant has changed since. Usually that bet is safe, which is why caching works at all, but when it is wrong the failure is quiet: a response cache in front of a support bot keeps serving last quarter's refund policy in the bot's usual confident register, and nothing in the answer signals its age.
Freshness and conflicts: govern the knowledge you answer from had you write freshness SLAs, the promises about how quickly each kind of answer reflects a change in the world. A cache lifetime is where those promises get enforced or broken. If policy answers must reflect changes within a day, no cached policy answer may live longer than a day, and when a source document changes, the answers built on it should be removed immediately rather than left to expire on schedule.
Measure the repetition before you build any of it
A cache only pays when traffic repeats, and the measure of repetition is the hit rate, the fraction of requests served from the cache instead of paid for fresh. You can know this number in advance: a week of production logs tells you how much of your workload repeats before you build anything, and a workload with no repetition caches nothing.
The three caches read the same logs differently. Prompt caching hits on nearly every call as long as requests share their front, so almost any production workload earns it. Response caching hits only where the logs show verbatim repeats, which is common for public documentation questions and rare for personal assistants. Semantic caching lands somewhere between the two, and where it lands depends on a matching threshold you have to choose and then defend with spot checks against wrong-answer matches. A support bot may find that a small set of questions covers half its traffic, while a research assistant may find every query unique and earn prompt caching alone. Both verdicts are useful, and the logs hand you either one before you have built anything.
Try it now
This drill spends no new tokens, and it produces the verdict this chapter is about: which caches your workload actually earns.
Pull a week of real prompts. Export a week of requests from your product's logs, or, if you have no production traffic yet, use a week of your own assistant history as a stand-in. Scale it down: one busy day, or the most recent hundred requests, keeps the sorting to minutes and still reveals the pattern.
Sort them into three piles. By eye, mark each request as an exact repeat of another (the words match), a near-repeat (different words, same intended answer), or a one-off. Sorting by eye is the honest version of what a semantic cache will do automatically, and the requests you hesitate over are exactly the ones a cache will get wrong.
Estimate the hit rate for each cache. Exact repeats over the total is your response-cache hit rate, exact plus near-repeats is the ceiling for a semantic cache, and the fraction of each prompt that is fixed front matter (rules, tools, shared documents) approximates what prompt caching would discount on every call.
Write the verdict with lifetimes. In a few lines, record which of the three caches this workload earns and, for each one you would build, the lifetime you would give its entries, taken from the freshness SLA of the content behind them. Keep the note, because it becomes a line in the budget you assemble in Write your Inference Budget and ship a feature that pays for itself.
Chapter Summary
- Caching stores work you already paid for and serves it again when the same need returns, and AI products use three kinds of cache.
- Prompt caching stores the unchanging front of your request with the provider, so repeat reads of rules and shared documents bill at a small fraction of the normal input rate.
- Response caching serves a stored answer to an exact repeat of a question and skips the model call entirely.
- Semantic caching matches near-duplicate questions by meaning and shares one answer among them; it hits the most and risks the most, because "close enough" is a judgment.
- Stable, shared content caches well; personalized, time-sensitive, and memory-dependent content does not.
- Order every prompt with the stable content first, because the first changed token ends the cached discount for everything after it.
- Every cached answer is a freshness decision, so give each entry a lifetime derived from your freshness SLAs and evict entries the moment their source changes.
- Measure the hit rate from real logs before building, because a workload with no repetition caches nothing.
- Caching discounts the work that repeats; the next discount pays you for work that can wait, which is where Batch and background: choose the realtime line picks up.
Sources
- Anthropic and OpenAI prompt caching documentation (last verified July 2026).
- Bang, F. (2023). GPTCache: An Open-Source Semantic Cache for LLM Applications. Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS), EMNLP.
- Character.AI engineering blog (2024). Optimizing AI Inference at Character.AI.
- Current cached-token rates and discounts: the dated Price Sheet.