Latency: design how fast it feels · The Builder's Stack

Cost per taskRoute to cheap first

You put two support products side by side on the same billing question, and behind both of them the model takes about nine seconds to produce the complete answer. The first product holds a spinner over a blank panel and then prints the reply in one block. The second prints its opening sentence within a second and keeps writing while you read along. When you check the reviews afterward, the second product gets called fast and the first gets called slow, and a few of the slow complaints come from people who gave up mid-wait. The meter behind both products recorded the same nine seconds, so nothing about the models explains the gap. What differed was how those seconds felt, and how a wait feels is something you design rather than something you buy.

The two numbers: time to first token and time to complete

Behind every response, two numbers describe the wait. Time to first token is how long the user stares at nothing before the first visible output appears (a token is the small chunk of text a model produces at a time, so this is effectively the time to the first word). Time to complete is how long until the answer is finished and the user can act on all of it. Engineering dashboards mostly track the second number, because completion time is what the infrastructure logs, while users mostly judge the first, because the moment before anything appears is the only moment the product looks broken.

Response-time research from decades before AI products puts firm markers on that judgment, and the markers have held across every kind of interface since:

Around a tenth of a second, a response reads as instantaneous.
Up to about one second, the user notices the delay but stays in flow.
Past about ten seconds, attention leaves for another tab, and a blank screen invites the user to leave with it.

A nine-second AI answer sits deep in the danger zone on the completion clock and can still live comfortably inside the one-second mark on the first-token clock, which is exactly where the streaming product in the opening scene lived.

Users pay attention to the time before the first word, so a product that starts answering in one second feels faster than one that finishes sooner behind a spinner.

Perceived speed is a design decision

The nine seconds themselves are set by the model, the prompt, and the length of the answer, and the rest of this part works on driving them down. How those seconds feel is set by the waiting state, and you can redesign a waiting state this week without touching the model. These are the moves that carry most of the improvement:

Stream the answer. Render output as the model produces it instead of holding it until completion. Most people read more slowly than a model writes, so once the first sentence is on screen the reader stays occupied until the end, and the felt wait ends at the first word.
Return results progressively. Deliver the outline first and the detail second. A research tool can list the sources it found before the synthesis arrives, and a coding assistant can show its plan before the diff, so the user starts judging the answer while the rest is still being produced.
Acknowledge instantly and honestly. The moment a request lands, confirm it in concrete terms ("checking the last three invoices on this account"). This is not a trick, because the work really is in flight; it converts a blank first second into evidence that the product is working on the right task.
Show real progress. Name the actual step underway ("reading the transcript", "comparing 14 documents") rather than a generic bar. Research on operational transparency, sometimes called the labor illusion, found that people rate a service as more valuable when they can watch the work happen, even when the wait itself is identical.

These moves work because of how attention behaves, which we covered in Perception: make the warning impossible to miss: attention goes to change that carries meaning. A spinner changes constantly and means nothing, so it stops informing after its first second on screen, while streamed words and real progress change and inform at the same time, so the same seconds feel occupied instead of empty. This is also why a faster model behind a spinner loses comparisons to a slower model that streams: the spinner's users experience the full completion time, and the streaming product's users experience the time to first token.

Set a latency budget for each feature class

A latency budget is two targets written down before the feature is built: a first-token target and a completion target. One pair of targets cannot cover a whole product, because different features put the user in different waiting postures, so set the pair per feature class:

A chat reply the user is watching needs a first token within about a second and a completion pace the user can read along with.
A requested artifact, like a report or a long summary, needs an instant acknowledgment, a first partial result within a few seconds, and completion within a minute or two behind visible progress.
A bulk or scheduled job needs instant confirmation that it was queued, completion by its deadline, and a notification when it finishes, and nothing in between.

Write the latency budget next to the cost target you built in The bill of materials: cost the task, not the call, because the two trade through the same lever. The model that produces the best answer usually runs at a large multiple of small-model rates and often takes longer to begin responding, a small model starts sooner and costs a small fraction as much, and provider options that buy faster completion carry premiums of their own. Every latency target is therefore also a spending decision, which is why the Inference Budget this part builds toward, in Write your Inference Budget and ship a feature that pays for itself, keeps a latency line beside every cost line.

Let unwatched work be slow and cheap

The budget also tells you where speed buys nothing. Overnight re-summarization of a help center, a weekly digest, re-tagging an archive of old tickets: nobody stares at a screen while these run, so the first-token clock does not apply to them and the completion clock only has to beat a deadline measured in hours. Paying real-time rates for that work buys a speed no one experiences, at the most expensive tier providers sell. The line between work that has to feel fast and work that only has to be done by morning deserves a deliberate decision per feature, and Batch and background: choose the realtime line is where we draw it.

Try it now

This drill costs a few ordinary prompts and about 15 minutes, and it produces the first two lines of your latency budget.

Pick one task and two products. Choose a task you can run identically in two AI products you already use: summarize the same long email thread in two writing assistants, or ask two support bots the same policy question.

Time both with a stopwatch. Run the task three times in each product and record two numbers per run: time until the first visible output and time until the answer is complete. Take the middle value of each three. A phone stopwatch is plenty, because you are measuring seconds rather than milliseconds.

Redesign the slower wait on paper. For the product with the worse first number, sketch its waiting state at three moments: what the screen should show at one second, what it should show at three seconds, and what it should show when the answer is done. Use the moves from this chapter: an honest acknowledgment, progress described in real steps, partial results before the whole.

Write the budget you would set. Finish with a first-token target and a completion target for that feature, each with one sentence on why the number fits the way its users actually wait.

Chapter Summary

Two numbers describe every wait: time to first token, which is how long before anything appears, and time to complete, which is how long before the answer is done.
Users judge speed almost entirely by the first number, because the blank moment before output is the only moment the product looks broken.
The classic thresholds still hold: about a tenth of a second reads as instant, about one second keeps the user in flow, and about ten seconds loses their attention.
Perceived speed is designable: streaming, progressive results, honest acknowledgment, and real progress beat a faster model behind a spinner.
Showing the actual work makes an identical wait feel shorter and the answer feel more trustworthy.
Set a first-token target and a completion target per feature class, because a chat reply, a requested report, and a bulk job are waited on differently.
Keep the latency budget next to the cost target, because the same model choice moves both numbers at once.
Work nobody watches has no first-token clock, so let it be slow and cheap.
The model you route each request to sets both the price and the wait, which is where Routing and cascades: the cheapest model that passes your bar picks up.

Sources

Miller, R. B. (1968). Response Time in Man-Computer Conversational Transactions. Proceedings of the AFIPS Fall Joint Computer Conference.
Nielsen, J. (1993). Usability Engineering. Academic Press (the response-time limits chapter).
Buell, R. W., & Norton, M. I. (2011). The Labor Illusion: How Operational Transparency Increases Perceived Value. Management Science.
Anthropic and OpenAI streaming API documentation (last verified July 2026).

Marks this chapter complete on your course map. Reaching the end does this for you.