Your feature works, and tonight it meets its first outside audience: the friend it was built for, phone in hand at your kitchen counter, feeding it her own real input. The first answer is good and she says so. The second takes nine seconds, long enough that she asks whether the app froze, and while you wait you open the provider dashboard to a number you have never checked: 412 calls this month, every one on the flagship you accepted as a default the day you made your first call. She asks what each answer costs. You do not know, and worse, you cannot say why this model, because you never chose it.
Part IV starts here. From Part III you carry a feature that works: an engineered prompt, a schema your code can parse, facts fetched at call time, and a tool the model can call, all on a strong default. That was right, since prompt and model problems look identical until the prompt is solid. Today the default earns its place with evidence or loses it to a cheaper tier.
The five axes, aimed at one real feature
A model choice comes down to five axes, and we will aim each one at FuelTheFam's fridge feature: the user snaps a fridge photo, it goes server-side to a multimodal model, and a shopping list comes back at the open door.
- Modality. Can the model accept your input at all? The fridge feature needs image input, which removes much of the field before any ranking matters; ask it first.
- Capability. Good enough at your task, not tasks in general. Naming groceries in a photo takes no deep reasoning, so reasoning depth is capability this feature never uses.
- Cost. You pay per token on every call: a fraction of a cent per list at current mid-tier rates, several times that on a flagship, multiplied by every user, every day.
- Latency. The wait, measured where your user stands: a parent at the open fridge counts seconds, and the app budgets meal logging so a full day of meals logs in under three minutes.
- Data handling. Where the input goes. A fridge photo shows the inside of someone's home, so we turned off the provider's retention and training options; run that check on whatever your feature accepts.
The public boards help with one axis, capability. They come in a few durable categories: arenas ranking models by blind human preference, indexes plotting quality against speed and price, and evaluations on private question sets no vendor can prepare for. No benchmark names appear here because they go stale: any benchmark famous enough to become shorthand gets saturated within quarters, and then it separates nothing. Shortlist two or three candidates through the Playbook's tool selection section and stop there; no board has ever run your inputs.
The real decision is between tiers
What survives the modality filter sorts into tiers: every major provider sells a flagship priced for the hardest work and a mid-tier that answers faster, costs a fraction of the flagship rate, and handles more than last year's flagship. For a defined feature, the live question is not which flagship tops this month's board but whether the two tiers differ on your inputs at all.
The Proof Table: five photos, two tiers, verdicts
We ran that comparison with five real fridge photos from our own phones: a typical case, a crowded case, a degraded case, a near-empty case, and one with a non-food object in the frame. Before any run we wrote the pass bar: a parent could shop from the list without reopening the fridge, and nothing on it is absent from the photo. An invented item fails outright, because nobody on our team reads a list before the user sees it. Each photo went through both tiers under the production prompt (return only items visible in the photo) and the production schema, one items field, a list of strings.
| Photo | Flagship | Mid-tier | Verdict |
|---|---|---|---|
| Typical: twelve items visible | All twelve | Eleven; missed the butter behind the milk | Both pass |
| Crowded: door shelf, nineteen condiments | All nineteen, both jams by flavor | Seventeen; merged the jams into "jam" | Both pass |
| Blurry: taken as the door swung shut | Six items, two not in the photo | Nine items, three not in the photo, no hedging | Both fail |
| Near-empty: three items and baking soda | All four, nothing added | All four, nothing added | Both pass |
| Off-domain: a toy dinosaur beside the yogurt | Food only | Food only | Both pass |
The mid-tier passed four of five, and the tiers separated only on the row both failed. On readable photos the whole gap was one hidden butter and two merged jams, differences that change no shopping trip. On the blurry photo the visible-items instruction lost to the blur on both tiers, and the cheaper model failed more dangerously: nine unhedged items that read like a good list. The fix is not a model: when a photo is unreadable, the feature asks for a retake instead of guessing.
A Proof Table is five real inputs, two models, and a verdict you can argue in prose, judged against a bar written before the run and rerun after every change to prompt, context, or model.
The obvious objection is that five photos prove nothing, and as statistics that is right: the boards run thousands of cases against our five. But the two measurements answer different questions. The boards established that both tiers handle photos at all; the table established what our feature does with our inputs, and its value was finding which input fails and how, not the pass rate. No leaderboard has a category for photos taken as a fridge door swings shut. The rigorous version, dozens of cases scored automatically, is an eval, and The Practice builds yours; a five-input run fits in an afternoon.
Ship the cheapest tier that passes your own five inputs, and read every failing row twice, because no model tier fixes an input problem.
Ship the choice: one config line and the first Model Spec entry
Switching costs one line if the model name lives in a single configuration value instead of repeated through the code. Ours is this diff:
- FRIDGE_MODEL = "provider-flagship-latest"
+ FRIDGE_MODEL = "provider-mid-tier-latest"
(The live values are the provider's current model strings; they rot too fast to print.) Fine-tuning, adjusting the model's weights on your own examples, stays the last rung of the ladder from Give the model tools to fetch and act, and it is almost never your move; nothing this feature needed sat beyond a better prompt, fresher context, or a retake rule.
The choice also belongs somewhere more durable than a config line: the Model Spec is the one-page document Part IV fills a section at a time, signed when you plan the build and ship it. Ours starts with its model section:
Model: current mid-tier multimodal (name pinned in config)
Chosen over: the same provider's flagship
Evidence: Proof Table v1, July 2026; passed 4 of 5
Why: no gap on readable photos; a fraction of the cost and wait
Known failure: both tiers invent items on blurry photos; the retake rule catches it
Revisit when: a failing row appears that a retake cannot fix
Copy the Model Spec starter; the chapters ahead earn the rest.
Try it now
About thirty minutes; the first path spends nothing, the second pennies. You arrive with Part III's result: a working feature on your default model, with prompt, schema, facts, and tool. You leave with your Proof Table v1, your chosen model, and the Model Spec's model section filled.
No setup: Copy the Proof Table template into any document and build everything except the outputs. Collect five real inputs from the user your one-pager describes, one per kind: typical, crowded, degraded, near-empty, off-domain. Write a one-sentence pass bar for each row before any model runs, then mark which rows you predict a cheaper tier would fail.
With your tools: Ask Claude Code to pull your model name into one config value, then run your five inputs through your default and the same provider's cheaper tier, pasting both outputs into the table. Judge each row against the bar, write one-sentence verdicts, and set the config line to the cheapest tier that passed. Then fill the Model Spec's model section: the choice, the evidence, the revisit trigger. Same move in Codex or Cursor: one config value, five inputs, two tiers, verdicts. If nothing is installed yet, the Setup Clinic gets you running.
Chapter Summary
- You arrive with a feature that works on a strong default; this chapter decides, on evidence, whether that default is the model you ship.
- Five axes decide the choice: modality, capability at your specific task, cost per call, latency where your user stands, and data handling.
- Use the public boards only to shortlist: benchmark names saturate within quarters, and no board has ever run your inputs.
- The real decision is usually between one provider's tiers, and today's mid-tier handles more than last year's flagship did.
- A Proof Table is five real inputs, two models, and argued verdicts against a bar written before the run; rerun it after every change.
- Our run surprised us: the mid-tier passed four of five, both tiers failed the blurry photo, and the fix was the retake rule, not the bigger model.
- Keep the model name in one config value so switching stays a one-line change.
- Fine-tuning is the last rung of the customization ladder and is almost never your move.
- Record the choice, the evidence, and the revisit trigger in the Model Spec, the document this part fills and signs at ship.
Next, Wire the model into your build turns the chosen model into running code in your own repo.
Sources
- OpenAI, Anthropic, and Google model documentation and pricing pages (last verified July 2026).
- Public model leaderboards in the categories described in this chapter; current links are maintained in the Playbook's tool selection section (last verified July 2026).
- FuelTheFam fridge-feature Proof Table run, our own build records, July 2026.