The quarterly review reaches the AI roadmap, and someone asks the simple question: which of these features make money? The room is not short of evidence. Quality dashboards sit above their bars, latency graphs run comfortably inside their targets, and the adoption curves bend the right way. But no one can put a margin next to a feature name, because finance holds one invoice for the whole product and engineering holds token counts nobody has tied to revenue. So the discussion slides to conviction, and the feature that quietly loses money on every use happens to be one people love, so it survives another quarter, its bill still arriving addressed to the whole roadmap.
Why good AI features die of bad unit economics opened this part with the feature that dies of its own margin, and every chapter since has settled one piece of the defense, from what a task costs to what the user pays. Each of those decisions lives wherever it was made, in a dashboard, a routing config, or a pricing memo, and this chapter moves them onto one signed page.
One page that puts a margin next to the feature
The Inference Budget makes a feature's economics someone's signed responsibility: a cost target, a latency budget, a routing bar, and alarms with names on them.
The budget ships as a fillable PDF, written per feature rather than per product, because margin lives at the level of one feature and dies there too. Its six sections each compress one chapter of this part into fields you can complete in an afternoon, and the page ends with alarms and a signature. Filled in, it answers the review's question in writing, and the rest of this chapter walks it section by section.
Name the unit and set the latency budget
The unit and its target. The first section carries the receipt from The bill of materials: cost the task, not the call: the task with its completion marker (the email sent, the ticket resolved), the per-task cost today from real logs rather than a flattering per-call figure, and the target that cost must reach, with a date next to it. Set the target from what the task earns rather than from ambition; a target without a date is a wish.
The latency budget. The second section carries the two clocks from Latency: design how fast it feels: a first-token target and a completion target that fit the feature's class, whether chat reply, requested report, or bulk job. Writing them into the budget makes them a constraint on every cost move that follows, because a cascade or a cache that breaks the completion target has saved money by making the feature worse.
Record the routing plan and the caching plan
The routing plan. From Routing and cascades: the cheapest model that passes your bar: the table of traffic lanes and the model serving each one, the escalation gate that promotes a request to the stronger model, and the eval bar that admitted each model to its lane, drawn from the suite you built in The quality bar: decide what good means. A lane whose model has no eval result next to it is running on somebody's impression, and that field stays empty until the evals have run.
The caching plan. From Caching: never pay for the same thinking twice: the caches this workload has earned, from the prompt prefix every request shares to the answers that repeat across users, each with the lifetime that keeps its contents honest and the hit rate you expect it to hold, a number the alarm section will watch.
Draw the realtime split and take the pricing stance
The realtime split. From Batch and background: choose the realtime line: every job the feature runs, sorted into realtime work a user watches, background work a user expects soon, and batch work that only has to be done by a deadline, which runs on the discounted tier. The sort is a spending decision dressed as a scheduling one, so any job on the realtime line carries one sentence on who is watching it wait.
The pricing stance. From Pricing: charge in a currency your costs track: the currency the user is charged in and how closely it tracks the tokens spent, the allowance included in each plan, and the margin at the median user and at the heavy user. The per-token rates behind the arithmetic live in the dated Price Sheet rather than on the page, so the budget ages more slowly than the prices do.
Wire the alarms
A budget with no alarms is a forecast, and forecasts do not page anyone. Three numbers from the earlier sections get a threshold and a person's name.
- Per-task cost crossing its threshold means the receipt has drifted: the context got stuffed, retries climbed, or the traffic changed under you.
- Escalation rate climbing means the cascade is converging on flagship prices for everything.
- Cache hit rate falling means the workload moved and a saving you counted on has stopped arriving.
Each threshold is wired like the drift alarms in Production signals: evals after the ship: a check that runs on live traffic and interrupts a human when it breaks, not a chart someone might open. The person paged is written on the page, because a threshold that notifies "the team" notifies no one.
Sign it, review it, and pressure-test it before shipping
The last field is a signature: the person accountable for the feature's economics, usually you, with a name and a date, because an unsigned budget is a spreadsheet nobody answers for. Next to the signature go the two review triggers: the provider reprices, or the workload moves (an alarm fires or the traffic mix shifts). Either one reopens the page, and amendments go in writing so the page stays the record.
Before the signature goes on, pressure-test the page against three futures on paper.
- Adoption triples. Does the margin at the median hold at three times the volume, does the batch line absorb the deferrable work, and which alarm fires first?
- The provider cuts prices on the order of half. Cheaper tokens redraw the routing arithmetic and maybe the pricing stance. Does the page say who reruns the numbers, and whether the saving goes to margin or to price?
- One customer becomes a heavy user overnight. Does the allowance cap the loss, and does the per-task alarm page someone before the invoice does?
If the page answers all three without a meeting, it is ready to sign; wherever it cannot, the field is not really filled.
You arrived at this part with a feature that works and a bill addressed to the whole roadmap. You leave with a feature that can defend its own existence: a unit with a cost target, a designed wait, a model earning its rate on every lane, nothing paid for twice, nothing realtime that did not need to be, a price that tracks the cost, and alarms with names on them, on one page over a signature. Bring that page to the next quarterly review; the rest of The Frontier applies the same discipline to what your product knows, how its agents are put to work, and what high-stakes territory demands.
Try it now
The drill is the capstone, so block an afternoon and bring one real feature.
Get the budget. Download the fillable Inference Budget and gather your notes from this part's earlier drills: the per-task receipt, the latency targets, the escalation rate, the caches, the queue split, and the pricing arithmetic; most fields are already drafted there.
Fill it end to end. Work in the order this chapter walked: unit and target, latency budget, routing plan, caching plan, realtime split, pricing stance, then the alarms. Where a field has no answer, make the decision now rather than typing a placeholder. Scale it down: where the instrumentation does not exist yet, enter an order-of-magnitude estimate, mark it as an estimate, and make wiring the real number the first request you hand engineering.
Run the three pressure scenarios. On paper, triple the adoption, cut the provider's prices by about half, and hand one customer a heavy month. Note every field that had no answer or pointed at a person who would not actually be paged.
Amend what failed and sign it. Fix the fields the scenarios exposed, put a name and the review triggers at the bottom, and bring the page to the next roadmap review.
Chapter Summary
- The Inference Budget is one signed page per feature that records this part's decisions: the unit and its cost target, the latency budget, the routing plan, the caching plan, the realtime split, and the pricing stance.
- The unit section names the task, its per-task cost today from real logs, and the target that cost must reach by a date.
- The latency budget writes down the first-token and completion targets that every later cost move must live inside.
- The routing plan records which traffic gets which model, the escalation gate, and the eval bar that admitted each model to its lane.
- The caching plan lists the caches the workload earned, each with a lifetime and an expected hit rate.
- The realtime split sorts every job into realtime, background, or batch, and the batch line takes the discount.
- The pricing stance records the currency, the included allowance, and the margin at the median user and the heavy user.
- Per-task cost, escalation rate, and cache hit rate each get a threshold that pages a person by name, wired like production drift alarms.
- The budget is signed by the person accountable for the feature's economics, reopened when the provider reprices or the workload moves, and pressure-tested against tripled adoption, halved prices, and an overnight heavy user before it ships.
- This closes The Unit Economics of AI; the rest of The Frontier is where the same signed-page discipline meets knowledge, agent fleets, and high-stakes territory.
Sources
- Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv.
- Huyen, C. (2025). AI Engineering: Building Applications with Foundation Models. O'Reilly Media.
- Anthropic, OpenAI, and Google pricing, batch, and prompt caching documentation (last verified July 2026).
- The dated Price Sheet for current per-token rates and tier ratios.