The bill of materials: cost the task, not the call · The Builder's Stack

Economics kill featuresDesign the wait

The cost sprint delivers. Your team trims the system prompt, moves the first pass to a smaller model, and cuts the cost per API call by a third, a win that headlines the next review. Then the invoice for the following month lands higher than the one before it. The rewrite that made each call cheaper also restructured the flow: what used to finish in two calls now takes a planning call, a drafting call, and a judge pass to check the draft, and when the judge fails the draft, the loop runs again. Finishing one user task now takes five calls and, on a bad run, two retries. Every call got cheaper, every task got more expensive, and nobody was counting tasks.

Count successful tasks, not API calls

In Why good AI features die of bad unit economics, features died because the cost of serving each user climbed while the revenue from each user stayed flat. Repairing that starts with measuring the right unit, and the right unit is cost per successful task: everything you spend, divided by the number of tasks users actually completed. A task is the thing the user came to do, an email drafted and sent, a support question resolved, a meeting turned into a summary the user accepted. A call is a piece of plumbing the task happens to use, and the call count changes every time engineering restructures the flow, which is exactly what happened in the opening scene.

The unit of cost is the successful task, not the API call: failed attempts, retries, and judge passes all belong on the same receipt.

The word successful carries real weight in that definition. A task the user abandoned halfway, an answer the user rejected, a draft that failed the judge three times before the user gave up, all of it spent tokens and produced nothing anyone paid for, so all of it goes into the spend while only completed tasks go into the count you divide by. The arithmetic is unforgiving: if one task in five fails, the four that finish each carry a share of the failed task's spend on top of their own, which means a team can leave its prompts untouched and still cut unit cost just by raising the completion rate.

The bill of materials for one task

Costing a task means itemizing everything spent between the moment the user starts it and the moment they get what they came for. The receipt has more lines than most teams expect.

Input tokens. Everything the model reads: the system prompt, the instructions, the retrieved facts, and the conversation history, billed again on every call. A task that takes five calls pays to send much of the same context five times.
Output tokens. Everything the model produces, including intermediate drafts and tool arguments the user never sees. Providers bill output at a multiple of input rates, so verbose intermediate steps cost more per token than anything the model read.
Tool calls. Every lookup the product makes mid-task adds a round trip, and the results come back into the window as input tokens on the next call; external search and data APIs can add per-request fees of their own.
Retries and failed attempts. A malformed response, a timeout, a draft the judge fails: each one bills in full and delivers nothing, which makes retries pure cost without revenue.
Evaluation overhead. Judge passes on live traffic and the sample of outputs you grade for quality both run on the meter, and they belong on the receipt of the tasks they check, machinery we set up in Graders: deterministic, judges, and humans.
The heavy-user tail. The average receipt hides the users who attach long documents and carry long histories; their tasks can cost a large multiple of the median task, and on many products a small slice of users accounts for most of the spend.

Miss a line and the per-task figure reads lower than reality, and the gap between your figure and the truth becomes the recurring bad news in your invoice.

The instrumentation to ask for by name

You do not need to build any of this yourself, but you do need to ask engineering for it by name, because the default dashboards will not volunteer it.

Per-task token logs. Every model call tagged with the task it served, carrying its input and output token counts, so a task's spend can be added up instead of guessed at.
A task-completion marker. An event that fires when the user gets what they came for: the email sent, the ticket closed, the summary accepted. Without it there is no denominator, and cost per task cannot be computed at all. The same marker feeds the live quality tracking from Production signals: evals after the ship.
Retry counts. How many attempts each task took, and what the failed attempts died of, since every retry is a full-price call that produced nothing.

If the dashboard your team watches shows calls and not tasks, the number lies in your favor until the invoice corrects it, because any restructuring that adds calls makes each call look cheaper while the task quietly gets more expensive.

Where the money hides: stuffed context and retries

When a team writes its first honest receipt, it usually expects output tokens to dominate, because the output is the part everyone looks at. The receipt almost always says otherwise. On multi-call features, input tokens dominate: the history that gets re-sent on every call, the retrieved documents pasted in whole when a paragraph would have served, the system prompt that has accumulated additions until it dwarfs the question it frames. Retries compound this, because a failed attempt re-bills the entire stuffed window, not just the answer that came back wrong.

On multi-call features the money hides in stuffed context and retries, the tokens you re-send on every call, not in the answer at the end.

This is why the highest-leverage cost work often looks like context work. Deciding what actually belongs in the window on each call, and how much of it, is the discipline we teach in Context budgets: fit the right facts into a finite window, and the receipt is what tells you whether that discipline is paying. The per-token rates themselves move too often to print here; our dated Price Sheet keeps the current numbers, and the stable pattern is that flagship rates run at a large multiple of small-model rates, with output priced above input.

Try it now

This drill produces your first real receipt, and you will carry it into Write your Inference Budget and ship a feature that pays for itself at the end of this part.

Get your task. Pick one real workflow with an unambiguous finish line: a task in your own product if you can read its logs, or a task you personally run with an assistant, like turning meeting notes into a document you actually send.

Log every call it takes. Run the task end to end and record each model call it makes, with input and output token counts from your provider's usage dashboard, and mark which calls were retries or checks rather than first attempts. Scale it down: one task, hand-counted straight from the usage dashboard, is enough to learn the method.

Write the receipt. On one page, total the tokens in, the tokens out, the tool calls, the retries, and any judge passes, then compute the per-task figure at your provider's current rates, which live on their pricing page and in the Price Sheet.

Read it for the hiding spots. Circle the largest line. If it is input tokens, write one sentence on what filled the window and whether the task needed it; if it is retries, write one sentence on what the failed attempts died of. That sentence is the first cut you will make.

Chapter Summary

Measure cost per successful task: everything you spend, divided by the tasks users actually completed.
A task is what the user came to do; calls are plumbing, and the call count changes every time the flow is restructured.
Failed attempts, retries, and judge passes spend money and finish nothing, so they belong on the same receipt as the tasks that succeed.
The full bill of materials covers input tokens, output tokens, tool calls, retries, evaluation overhead, and the heavy-user tail.
Ask engineering by name for per-task token logs, a task-completion marker, and retry counts.
A dashboard that shows calls instead of tasks flatters every optimization until the invoice corrects it.
The money usually hides in input tokens and retries, the context re-sent on every call, not in the answer at the end.
Volatile rates live in the dated Price Sheet; chapter arithmetic stays in ratios that hold.
Cost is one axis of the budget, and the other is how long the user is willing to wait, which is where Latency: design how fast it feels picks up.

Sources

Huyen, C. (2025). AI Engineering: Building Applications with Foundation Models. O'Reilly Media.
OpenAI and Anthropic usage dashboard and pricing documentation (last verified July 2026).
The dated Price Sheet for current per-token rates and model-tier ratios.

Marks this chapter complete on your course map. Reaching the end does this for you.