Your meeting notetaker ships a summary feature to a launch channel full of praise, because the summaries are genuinely good and nobody has to take notes anymore. Adoption triples within a month as whole teams turn it on for every recurring meeting. Then the first full month's invoice lands at a number several times anything the plan anticipated, and finance asks a question with two halves: what does one summary cost us, and what is one summary worth? Nobody in the room can answer either half. The feature's quality was reviewed, its latency was tested, its rollout was staged, and its economics were never designed at all. This part is about answering finance's question before it gets asked.
The cost cliff between demo and production
A demo is nearly free because it runs once. One handpicked transcript, one model call, one polished summary on the screen, and the total spend for the meeting that green-lights the feature would not cover the room's coffee. That single run tells you everything about quality and almost nothing about cost, because production cost is the call multiplied by everything the demo left out.
Production multiplies in ways the demo never showed:
- Every user, every task. The one run becomes every meeting on every calendar that turned the feature on, and the meter starts the moment the rollout does.
- Retries. When an output fails validation or a call times out, the product pays for the failed attempt and again for the retry, while the user sees one answer.
- Accumulating context. Conversations lengthen, transcripts run long, and retrieval attaches documents to every call, so input cost climbs across a session and the tenth turn can cost a multiple of the first.
- Success itself. The better the feature is, the more often it runs, so the bill climbs fastest exactly when the product is working as intended.
Serving one more user of conventional software costs close to nothing, which is where software's famous margins come from, while an AI feature pays for compute on every single answer.
An AI feature has the cost structure of manufacturing rather than software, because every answer consumes metered materials, so the margin lives or dies at the level of one unit.
Features die of this quietly. Across the industry the pattern is rarely a public shutdown; it is a free tier that shrinks, a daily cap that appears, a capability that moves behind the most expensive plan, or a second version that never arrives. In most of those cases the quality held up fine, and the invoice made the decision.
Why bills keep rising while token prices fall
The obvious objection is that token prices are collapsing, and they are. The price of a given level of capability has fallen hard year over year, in some stretches by an order of magnitude within a single year, and industry analyses expect the slide to continue. If your product did exactly the same work every month, its bill would melt on its own.
Bills rise anyway, because no product does the same work every month. Consumption multiplies faster than prices fall:
- Agents turn one call into many. A task that was a single request in the chat era becomes a dozen calls when an agent breaks the work into steps, reads files, uses tools, and checks its own output.
- Contexts keep lengthening. Larger windows invite larger inputs, and reasoning models produce intermediate tokens that you pay for and the user never sees.
- Usage deepens. The users who adopt an AI feature lean on it harder over time, and product teams keep finding new places to call the model because each call now looks cheap.
Economists met this pattern long before tokens existed. The Jevons paradox, first observed when more efficient steam engines increased Britain's total coal consumption rather than cutting it, says that when a resource gets cheaper to use, total spending on it can rise, because the lower price makes more uses worth pursuing. Cheaper tokens behave the same way: every price drop moves another marginal use into the affordable column, and teams spend the savings, and then some.
So falling prices will not rescue a bill nobody designed; the teams whose bills actually fall hold their consumption steady and bank the price drops, which takes the decisions this part teaches.
Margin is decided in the same reviews that decide quality
An AI feature rarely dies because it is bad; it dies because nobody designed its margin, and margin is as designable as quality.
Nobody ships hardware without costing it. A phone team can tell you what the battery, the casing, and the camera module each contribute to the cost of one unit, and every design review negotiates those contributions against what the product can charge. An AI feature is manufactured too, out of model calls, retrieved documents, retries, and tool use, so it deserves the same bill of materials: a per-unit cost you can set beside per-unit value and defend in a review.
The reviews already exist. You decide what good means and hold every change to it, the discipline from The quality bar: decide what good means, and the same meetings should carry the cost question, because most changes move both numbers at once. A prompt revision that improves tone and doubles output length is a margin decision, and so is a model upgrade that lifts accuracy at a multiple of the rate. When cost per task sits on the same page as the quality score, those trades get made with open eyes instead of discovered after the money is gone.
What this part covers
We priced the fleets that build products in Economics: what a fleet costs and when it pays, where the meter runs on your own agents. This part prices the product itself, where the meter runs on every user, and each chapter hands you one lever:
- What one completed task costs, counted across every call it triggers: The bill of materials: cost the task, not the call.
- What the wait feels like and which speed is worth paying for: Latency: design how fast it feels.
- Which model runs each task: Routing and cascades: the cheapest model that passes your bar.
- What you never pay for twice: Caching: never pay for the same thinking twice.
- Which work can wait for the cheap tier: Batch and background: choose the realtime line.
- What the user pays, and in what unit: Pricing: charge in a currency your costs track.
The part closes when you write your Inference Budget and ship a feature that pays for itself, which signs those decisions into a one-page budget (the fillable Inference Budget). One convention holds throughout: current rates and discounts move too fast for prose, so the chapters speak in ratios (flagship rates run at a large multiple of small-model rates, batch tiers discount on the order of half) and the dated Price Sheet keeps the live numbers.
Try it now
This drill spends no tokens; it is pencil work against a pricing page.
Pick your unit. Choose one AI feature you use daily, a meeting summarizer, an email drafter, a coding assistant, and define its task the way a user would count it: one summary, one drafted reply, one resolved question.
Get the rates. Open the pricing page of the provider you believe runs it, or the one you would run it on, and note the input and output rates for a plausible model.
Estimate one task. Guess the tokens in (instructions, context, the document or transcript itself) and the tokens out, to the nearest power of ten, then multiply into a cost for a single task; order of magnitude is the point.
Run the three volumes. Multiply your task cost by a hundred, ten thousand, and a million tasks a month, and write the three monthly bills side by side.
Mark your caring line. Note the volume at which you would start to care what the answer costs, and write one sentence on what you would cut first: the model, the context, the retries, or how often the feature runs.
Keep the page, because The bill of materials: cost the task, not the call replaces your guesses with a method, and the numbers you wrote today become the before picture.
Chapter Summary
- A demo prices one run on a handpicked input; production multiplies that run by every user, every retry, and every token of accumulated context.
- An AI feature pays for compute on every answer, so its cost structure is closer to manufacturing than to conventional software.
- The bill climbs fastest when the feature succeeds, because adoption is a multiplier on the meter.
- Falling token prices do not lower bills on their own; agents, longer contexts, and deeper usage expand consumption faster than prices fall.
- Cheaper tokens invite more token spending, the same paradox economists first documented with coal.
- Margin is as designable as quality, and it belongs in the same reviews, costed with a bill of materials like any manufactured unit.
- Never let the invoice be the alarm: estimate cost per task before launch and set an alert at the level where the estimate proves wrong.
- The estimating starts in earnest in The bill of materials: cost the task, not the call, which turns today's guesses into a method.
Sources
- Jevons, W. S. (1865). The Coal Question. Macmillan.
- Appenzeller, G. (2024). Welcome to LLMflation: LLM inference cost is going down fast. Andreessen Horowitz.
- Stanford HAI (2025). The AI Index Report 2025 (inference cost and adoption trends). Stanford University.
- Anthropic and OpenAI pricing documentation (last verified July 2026).
- The Price Sheet, this part's dated pricing companion (/artifacts/price-sheet.md).