Skip to content
AI-Native PM
8 min · 0 of 8 in The Unit Economics of AI

Routing and cascades: the cheapest model that passes your bar

Your product runs every request through the flagship model because that is what the demo ran, and the demo is what got the feature approved. Months after launch, someone finally pulls a week of production traffic for a routing review. Most of it turns out to be short, formulaic work: order status lookups, one-line rewrites, questions the help center already answers. You rerun a sample through a model priced at a small fraction of the flagship rate, set the answers side by side, and your own graders cannot tell them apart. The flagship was earning its rate on maybe a tenth of the volume, the long, tangled cases that genuinely need it, and on everything else you were paying a large multiple for an identical answer.

Match the model to the request, not to the demo

Live traffic is never uniform. A support bot fields "where is my order" in the same queue as a refund dispute that spans three policies, an email assistant fixes a typo in one message and drafts a delicate renewal negotiation in the next, and a research tool answers a definition lookup right before a synthesis question that needs many sources reconciled. Serving all of that with one model means either overpaying on the easy requests or underserving the hard ones, and defaulting to the flagship everywhere means overpaying.

If you worked through Economics: what a fleet costs and when it pays, you have made this match once already, putting cheap models on a fleet's simple jobs and the strongest model in the judge seat. Routing is the same match made for live product traffic, request by request, by machinery you design rather than a person assigning roles.

Route every request to the cheapest model that passes your quality bar, and let your evals, not your impressions, decide which model that is.

Three ways to build the router: rules, cascades, and retries

A router is whatever sits in front of your models and picks one per request. The industry has converged on three designs, and most production systems stack all three.

  • Rules before the request. A rule routes on properties you can read before any model runs: the request type, its length, the feature it came from, the customer's tier. Rules are the cheapest and most predictable design, since evaluating one costs nothing per request and the same input always takes the same path. Their limit is that they cannot read difficulty that lives inside the text, because a short request is not always an easy one.
  • Cascades. A cascade tries the small model first and escalates when the answer looks weak. "Cascade" is the industry's word for exactly this, small-first with escalation, and the escalation trigger is a confidence gate: a score attached to the answer estimating how likely it is to hold up, drawn from the model's own uncertainty signals or from a separate cheap checker, the same machinery as the judges in Graders: deterministic, judges, and humans. Cascades read real difficulty, because the small model failing is itself the evidence that the request is hard, but every escalated request gets billed twice.
  • Retry on failure. The safety net behind both. When an answer fails a hard check downstream (malformed output, a failed tool call, a grounding check that comes back empty), the product retries on a stronger model instead of returning the failure. Retries are not mainly a cost move; they exist to catch what the other two layers missed.

A typical stack uses rules for the coarse split (contract-analysis requests go straight to the flagship while the high-volume simple lane enters a cascade), the cascade to handle the bulk cheaply, and retries as the backstop under everything.

Let your evals decide which model serves which traffic

The dangerous phrase in any routing conversation is "the small model seems fine." It seemed fine on the requests someone happened to read, and a handful of requests nobody selected for coverage is too thin a sample to trust. The routing criterion is "passes your evals": the suite you built in The quality bar: decide what good means is what admits a model to a traffic lane, slice by slice. If the small model passes your eval set for order-status questions and fails your set for refund disputes, it has earned the first lane and not the second, and no demo impression overrides that.

The same discipline covers changes. Rerouting a slice to a cheaper model is a model swap for every user in that slice, so it ships through The regression gate: no change ships blind like any other change to the product's behavior.

Cascades save money only while escalation stays rare

An escalated request costs the small call plus the flagship call, so it bills slightly more than sending it straight to the flagship, and it also waits through two models in sequence. The whole economic case for a cascade therefore rests on how often escalation happens.

When the small model clears the large majority of traffic, the blended cost falls toward the small-model rate, which is the saving the team in the opening scene was leaving on the table. When escalation is frequent, you converge on paying flagship rates plus a small-model surcharge on everything, with the extra wait on top, and that is a worse deal than no cascade at all. The exact ratio between model tiers moves with provider pricing, and the Price Sheet keeps the current numbers; the stable fact is that small models run at a small fraction of flagship rates, so the arithmetic favors cascades whenever easy traffic dominates, and a routing review like the one in the opening scene is how you find out whether yours does.

A cascade pays twice on every escalated request, so the escalation rate decides whether a cascade saves money, and you measure it before you celebrate.

The escalation rate stays useful long after launch. A climbing rate means something changed: the traffic mix shifted toward harder requests, a prompt change weakened the small lane, or the confidence gate drifted. Keep it on the same dashboard as your cost per task, because if you never re-measure a cascade, you are only guessing that it still saves money.

Try it now

This drill builds a two-model cascade against your own eval cases, on paper or in a notebook, and it produces the one number the chapter turns on.

Get your cases. Pull thirty cases from your existing eval suite, the set your quality bar is defined against.

Run the small lane. Send all thirty through a small model, the cheapest tier your provider offers.

Judge every answer. Grade each answer against your bar, pass or fail, using the graders you already have: deterministic checks where they exist, a judge pass where they do not.

Escalate the failures. Send only the failing cases to your flagship and grade those answers too.

Do the arithmetic. Write down the cascade's quality (cases passing after escalation) and its cost (thirty small calls plus the escalated flagship calls), then put both next to flagship-everywhere (thirty flagship calls, graded once if you want the quality comparison). The share of cases that escalated is your escalation rate, and it tells you whether a cascade belongs in your product.

Scale it down: run ten cases instead of thirty and use a single judge pass for all the grading.

Chapter Summary

  • Live traffic mixes easy and hard requests, so one model for everything either overpays or underserves, and the flagship-everywhere default overpays.
  • Route every request to the cheapest model that passes your quality bar, the same match you made for fleet roles, now made per request on live traffic.
  • Rules route on what is visible before any model runs, and they are the cheapest and most predictable design.
  • Cascades try the small model first and escalate when a confidence gate flags the answer as weak, which reads real difficulty but bills escalated requests twice.
  • Retry on failure is the safety net that catches what rules and cascades miss.
  • A model is admitted to a traffic lane by passing that lane's evals, never by seeming fine on a handful of examples.
  • Every routing change is a model swap for the rerouted slice, so it ships through the regression gate.
  • Measure the escalation rate before crediting a cascade with savings, and keep watching it after launch.
  • Routing lowers the price of the requests you send, and Caching: never pay for the same thinking twice goes after the requests you never needed to send at all.

Sources

  • Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv.
  • Ding, D., Mallick, A., Wang, C., et al. (2024). Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing. ICLR 2024.
  • Ong, I., Almahairi, A., Wu, V., et al. (2024). RouteLLM: Learning to Route LLMs with Preference Data. arXiv.
  • Anthropic, OpenAI, and Google model tier documentation and pricing pages (last verified July 2026).
  • The dated Price Sheet for current per-tier rates.
Marks this chapter complete on your course map. Reaching the end does this for you.