Context budgets: fit the right facts into a finite window · The Builder's Stack

What it remembersKeep the facts fresh

The support assistant answered well at launch, so the team kept feeding it. Someone added the full product wiki to every request, then the customer's complete ticket history, then all twenty retrieved passages instead of the top three. Each addition felt like insurance. One afternoon a customer asks about the refund window, and the reply quotes a policy retired a year ago, pulled from deep in the wiki dump while the current policy sits unused in the middle. The quality dashboard, flat for months, has been sliding for three weeks, and the invoice for the same month arrives at triple the usual number. Nobody can say which addition did the damage, because every one of them was added to help and none was ever removed.

The model processes every token you send, and you pay for all of it

Every request your product makes is one block of text: the system rules, whatever the product injects from its stored memory, the retrieved passages, and the conversation so far. The model processes the whole block on every call, and the provider bills the whole block on every call, whether or not a given passage had anything to do with the answer. There is no skimming tier and no refund for the parts that turned out not to matter.

That pricing would be a footnote if more context reliably produced better answers, but it does not. Context helps until the facts the answer needs are present, and past that point every additional token is weight the model has to process and money you have to spend. Give the model the facts it wasn't trained on made the case for handing over what training never covered; this chapter makes the case for refusing to hand over everything else.

The window is a budget, not a warehouse: past the point where the right facts are present, every extra token costs money and quality at once.

Why a full window produces worse answers

Published research explains the sliding dashboard in the opening scene.

Models handle the middle of a long input worst. In a widely replicated 2023 finding, models answered most accurately when the relevant passage sat at the very start or very end of a long input, and markedly worse when it sat in the middle. Dump a whole wiki into the window and the fact you need usually lands in exactly that middle region.
Near-relevant filler drags accuracy down. A 2025 study measured what practitioners had been reporting: as a window fills with material related to the topic but not needed for the question, accuracy falls even on simple lookups the same model completes cleanly with a short input. The industry calls this context rot, meaning quality decay caused by the sheer volume in the window rather than by anything wrong with the facts themselves.

The practical rule follows directly: a retrieval step that selects the right passages beats a window that hoards everything. The lookup you built in Retrieval: let the product look things up before it answers exists so the window can hold the three passages a question needs instead of the three hundred that might matter someday, and the three-passage version answers better and costs a fraction as much.

Compaction: replace old history with what mattered

Even with disciplined retrieval, long sessions fill windows on their own, because history arrives one turn at a time and never leaves by itself. Compaction is the standard countermeasure: the product summarizes the older turns into a short digest, injects the digest in place of the full transcript, and keeps only the recent turns verbatim.

The digest is produced by a model call, but what it preserves should never be left to a model default. Decide it as a product requirement:

Decisions survive. "The customer chose the annual plan" must outlive the twelve turns that led to it.
Constraints survive. "This account tier has no refunds past 30 days" is a fact every later turn depends on.
Open questions survive. "Still waiting on the invoice number" is the difference between a thread that picks up cleanly and one that asks again.
Chit-chat, dead ends, and resolved back-and-forth do not survive. They earned their tokens the first time and never again.

What survives compaction is a product decision: decisions, constraints, and open questions stay, and the conversation that produced them goes.

For timing, order-of-magnitude judgment is enough. Compacting every turn pays for a summary call each time and discards detail too early, while waiting until the window is nearly full means answers have already degraded, so a workable default is to compact once the history crosses about half of its budget line and to cut it to roughly a tenth when you do. Compaction manages a single session; what deserves to persist across sessions is the separate decision you made in Memory: decide what your product remembers.

Write the budget down as a per-request split

Everything above turns operational the moment you write it as numbers. Give every request a total token budget and split it across the four things competing for the window: system rules, memory, retrieved facts, and history. A sensible starting split keeps system rules on a small fixed line, memory on a small line, retrieved facts on the largest flexible line, and history on whatever remains.

The exact percentages matter less than what you decide around them. First, decide who gets squeezed when the lines collide: history compacts first, retrieved passages trim from many to few second, memory trims third, and system rules never get cut at runtime, because rules that no longer fit are a design problem to fix, not a line to shave silently. Second, write the split down where the team can inspect it, in the repo next to the prompt, so that when quality drops you can see which line grew last month instead of guessing.

Every stuffed token is billed on every call

The cost coupling is easy to miss because it multiplies quietly. Context is re-sent with every message, so a forty-turn conversation carrying a stuffed window pays for the stuffing forty times, which is how the opening scene's invoice tripled without any increase in traffic.

Providers soften the repetition with prompt caching, a discount on the unchanged opening portion of a request, usually the system rules and other stable context; at volume the discount is real money, and the current rates and cache rules live in the Retrieval Stack Sheet because they change too often for chapter prose. Caching lowers the price of a bloated window, though, without touching the quality loss. The cheapest token, and the only one with no quality cost, is still the one you never send, and if you later run agent fleets, this same multiplication picks up a further multiplier per agent, priced in Economics: what a fleet costs and when it pays.

Try it now

This drill takes about 30 minutes and produces your own before-and-after evidence for the budget.

Get your unit. Pick one long transcript or document thread you own: an exported project discussion, a long-running support ticket, a working doc with weeks of edits. A few thousand words is enough.

Ask five questions against the full thing. Write five questions the thread can answer and note your own answers first. Then paste the entire thread into a model, ask each question, save the answers, and record the request size or cost from your usage dashboard.

Compact it by hand. Write a half-page summary that keeps only decisions, constraints, and open questions (drafting it with the model and editing the draft is fine). Then pick the two excerpts from the thread most relevant to your questions.

Ask the same five against the compact version. Same model, same questions, but the input is your summary plus the two excerpts. Save the answers and the cost again.

Score both sets blind. Shuffle the ten answers so you cannot tell which run produced which, grade them against your own answers, then unshuffle and put the two scores next to the two costs. Whichever way it comes out, you now hold the comparison as your own evidence instead of a claim from us.

Scale it down: two questions and one excerpt give you the same comparison in about ten minutes.

Chapter Summary

Every request is one block of text, the model processes all of it, and you are billed for all of it on every call.
More context helps only until the facts the answer needs are present; past that point, every extra token costs quality and money at once.
Models handle the middle of long inputs worst, and windows filled with near-relevant material degrade answers, the failure the industry calls context rot.
A retrieval step that selects a few relevant passages beats a window stuffed with everything that might matter someday.
Compaction replaces old history with a short digest, and what the digest keeps is a product decision: decisions, constraints, and open questions stay, small talk goes.
Give every request a written budget split across system rules, memory, retrieved facts, and history, decide in advance who gets squeezed first, and keep the split where the team can inspect it.
Prompt caching discounts repeated context but does not repair the quality loss; the token you never send is still the cheapest.
A budget fills the window with the right facts, but only if those facts are still right, which is the question Freshness and conflicts: govern the knowledge you answer from takes up next.

Sources

Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics.
Chroma (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma technical report.
Anthropic and OpenAI prompt caching and long-context documentation (last verified July 2026).

Marks this chapter complete on your course map. Reaching the end does this for you.