Batch and background: choose the realtime line · The Builder's Stack

Cache the repeatsPrice what it costs

Your product's weekly digest goes out at 9am sharp. Every user gets an AI-written summary of their week, generated that morning in one enormous burst through the realtime API, at realtime prices. The open logs tell the other half of the story: almost nobody reads the digest before lunch, and plenty read it the next day. The same job, submitted to your provider's batch tier the night before, would have finished hours before anyone woke up, cost on the order of half as much, and changed nothing a user could notice. Every week, the meter quietly records the gap between when the work was done and when anyone needed it. This chapter teaches you to see that gap in every AI job you run, and to collect the discount that lives inside it.

The realtime line: ask who is waiting

Every AI job your product runs has one question that sets its price floor before any model choice or prompt work: who is waiting for the result? A person watching the screen buys realtime inference, the most expensive way to purchase tokens, because the provider commits capacity to answer immediately. Every other job is negotiable, and most products never negotiate.

Price every job by who is waiting for it: a person watching buys realtime, and everything else can queue for the discount.

That one question sorts a product's AI work into three lanes.

Realtime: a person is watching. Chat replies, inline writing suggestions, code completions, a support bot mid-conversation. The user's attention is on the screen and every second is felt, so these jobs run on the realtime API and pay the full rate.
Background: a person asked and walked away. A research report that takes minutes to assemble, a meeting summary generated after the call ends, a long document translated overnight. The user needs a notification when it lands rather than a spinner while it runs, and how that wait should feel is the subject of Latency: design how fast it feels.
Batch: nobody asked at a particular moment. Weekly digests, re-tagging a support archive, refreshing embeddings after a catalog update, regenerating summaries across a document base. The deadline is measured in hours, which is exactly the window the batch tier sells.

The opening scene is a point on that picture: a batch job running in the realtime lane. Nothing about the digest needed an answer immediately, and the product paid the immediacy premium anyway, every week, multiplied by the whole user base.

The batch tier: a discount for a deadline measured in hours

The batch tier is the simplest trade in this part. You submit a pile of requests in a single upload, the provider processes them when capacity is free, and the results come back within a stated window, typically hours and commonly capped at about a day. In exchange, the work bills at a discount on the order of half off realtime rates, because you are buying the provider's idle capacity instead of its committed capacity. Batch queues also usually run under their own, much larger rate limits, so a bulk job stops competing with your live traffic for throughput.

The current windows, discounts, and limits move often enough that they belong on your provider's pricing page and in this part's dated Price Sheet, not in prose. What stays stable is the eligibility test: a job qualifies for batch whenever its deadline is looser than the batch window. The discount applies to the job's whole bill of materials, every call in the task you costed in The bill of materials: cost the task, not the call, which is why moving one large job across the line often saves more than a month of prompt trimming.

Queues and backpressure: what background work needs to run safely

Background and batch work does not simply run later; it needs somewhere to wait and rules for what happens while it waits. The decisions that cover it are ordinary infrastructure rather than anything AI-specific, and your engineers will recognize every one of them.

A queue. Jobs go into a line and workers drain the line. The queue is what lets a thousand requests arrive in one minute and complete over one hour without anything falling over.
A retry policy. Some calls will fail on rate limits, timeouts, or malformed outputs. Decide how many retries a job gets, how long to wait between them, and where a job goes when it keeps failing, so a bad input cannot loop forever at your expense.
A full-queue behavior. Backpressure, in plain words, is what the system does when jobs arrive faster than they finish: delay new work, drop the least important work, or degrade to a cheaper model until the line drains. Pick one before launch, because the alternative is the system picking for you during your best traffic day.

The expensive new failure mode in this lane is the always-on agent. An agent that polls its inbox every minute all night is making realtime calls with no one waiting for any of them, which is the digest mistake running continuously instead of weekly. Idle work should wake on an event (a new email arrived, a document changed) or on a schedule sized to the job, and the same discipline applies to your own builder fleets, which Economics: what a fleet costs and when it pays prices in detail.

Launch math: plan for the day you cannot batch

The realtime line matters most on the day it stops being movable. A launch, a press mention, or a viral moment multiplies your traffic exactly on the realtime slice, because the spike is made of people watching screens, and no batch tier can help a chat reply. Capacity planning for AI features starts from that fact: estimate peak concurrent users, multiply by the tokens of a typical session, and check the total against your provider's realtime rate limits well before launch day, because raising limits usually takes a request and a wait.

Then shrink the spike before it arrives by pre-computing everything that can be pre-computed. Starter suggestions, onboarding examples, summaries of your existing catalog, and embeddings of your document base can all be generated on the batch tier in the quiet week before launch, so that launch traffic hits cached results instead of the meter. Move every scheduled batch job out of the launch window too, so your own digest run is not competing with your customers for the same rate limit. The realtime slice you cannot avoid should reach launch day as the only thing on the line.

Try it now

This drill spends no tokens and works on any product whose AI features you can enumerate, yours or one you know well.

List the jobs. Write down every AI job the product runs: each user-facing feature, each scheduled job, each agent or automation, one line apiece with roughly how often it runs.

Sort them across the line. Mark each job realtime, background, or batch using the one test: who is waiting, a person watching, a person who walked away, or nobody at a particular moment. Be suspicious of anything marked realtime that runs on a schedule, because schedules mean nobody is watching.

Find the biggest misfit. Pick the largest job running on the wrong side of the line, measured by monthly tokens or by run frequency times size if you have no dashboard.

Price the move. Using the batch figures in the Price Sheet, estimate what that job costs today at realtime rates and what it would cost on the batch tier, keeping both numbers to an order of magnitude.

Write the queue-and-notify design. In one short paragraph, specify the move so a user could never detect it: when the job is submitted, what queue holds it, the retry policy, the full-queue behavior, and what (if anything) notifies the user when results land. If the paragraph is easy to write, the move is a small ticket rather than a project, and the saving repeats every time the job runs.

Chapter Summary

Every AI job has a price floor set by one question: who is waiting for the result.
A person watching the screen buys realtime inference at full rate; a person who walked away buys background work with a notification; work nobody is waiting on at a particular moment is batch.
The batch tier trades a deadline measured in hours for a discount on the order of half, with the current windows and rates on provider pricing pages and the dated Price Sheet.
A job qualifies for batch when its deadline is looser than the batch window, and the discount applies to the task's entire bill of materials.
Background work needs a queue, a retry policy, and a decided full-queue behavior before launch day decides for you.
An always-on agent polling for work is realtime spend with nobody waiting; wake idle work on events or a schedule sized to how often real work arrives.
Launch spikes hit the realtime slice exactly when you cannot batch it, so check rate limits early and pre-compute everything the batch tier can prepare in advance.
Cutting what each answer costs is half the unit; what you charge for it is the other half, which is where Pricing: charge in a currency your costs track picks up.

Sources

Anthropic Message Batches API documentation and pricing (last verified July 2026).
OpenAI Batch API documentation and pricing (last verified July 2026).
Microsoft Azure Architecture Center. Queue-Based Load Leveling pattern (last verified July 2026).
This part's dated Price Sheet for current batch windows, discounts, and rate limits.

Marks this chapter complete on your course map. Reaching the end does this for you.