name: cost-quality-tradeoff description: | Measure and optimize the cost/quality curve — which model, prompt, and settings give the best quality per dollar. Covers Pareto analysis, break-even thresholds, and when to spend more vs less. Use this skill when optimizing LLM spend, picking a default model for a feature, or deciding whether a premium model is worth it. Activate when: cost vs quality, model selection, eval cost, Pareto frontier, cheaper model, premium model tradeoff.
Cost vs Quality Tradeoff
Quality without cost context is half a decision. You need the Pareto frontier — for each quality bar, what's the cheapest config that hits it?
When to Use
- Choosing a default model for a new feature
- Reducing LLM spend on an existing feature
- Justifying (or not) an upgrade to a premium model
- Trading off prompt complexity, model size, and thinking budget
The Pareto Frontier
Plot each candidate config (model × prompt × settings) on quality (y-axis) vs cost per request (x-axis). The frontier is the set of configs where no other config is both cheaper AND better.
Any config NOT on the frontier is dominated — always strictly worse than another option. Drop it.
quality
↑
1 | *A (opus + thinking)
| *B (opus)
|*G *D (sonnet + few-shot)
|*F *C (sonnet)
0 |*E (haiku)
+---------------→ cost
Pareto: A, B, D, C, E. Dominated: F (worse than E at same cost), G (worse than D at same cost).
Measurement
For each candidate, measure:
| Metric | Example |
|---|---|
| Input tokens / request | 2,500 |
| Output tokens / request | 400 |
| $ / request | $0.012 |
| Quality score | 0.87 |
| p95 latency | 1.8s |
const costPerRequest = (usage.input_tokens / 1e6) * inputRate +
(usage.output_tokens / 1e6) * outputRate +
(usage.cache_creation_input_tokens / 1e6) * cacheWriteRate +
(usage.cache_read_input_tokens / 1e6) * cacheReadRate;
Always include cache costs — they dominate on cached workloads.
Common Configs to Compare
For any feature, try at least:
- Haiku with concise prompt
- Haiku with longer / few-shot prompt
- Sonnet with concise prompt
- Sonnet with few-shot + structured output
- Sonnet with extended thinking
- Opus with concise prompt
- Opus with extended thinking
One of these usually sits on the frontier for your workload. Don't assume — measure.
Prompt as a Lever
Before jumping to a bigger model, try prompt levers:
- Few-shot examples (2-5) often bump quality 5-15% at small cost
- Structured output (JSON schema) reduces parsing errors
- Chain-of-thought prompting helps reasoning without thinking tokens
- Better system prompt scoping (what's in vs out) improves accuracy
A better prompt on Haiku can beat a mediocre prompt on Sonnet — and cost 10× less.
Break-Even Analysis
When considering an upgrade, compute when it pays off:
Cost increase per request: Δcost = new - old
Quality increase: Δquality = new - old
Value per quality point: V (estimated from business metrics)
Worth it if: Δquality × V > Δcost
Example: If every 1% quality gain increases user retention revenue by $0.003/request, and upgrading Haiku→Sonnet costs +$0.002/request for +5% quality:
- Value gain: 5 × $0.003 = $0.015
- Cost: $0.002
- Net: +$0.013 per request → upgrade.
Tiered Routing
You don't have to pick one. Route by difficulty:
const difficulty = await classifyDifficulty(query);
const model = difficulty === "simple" ? "claude-haiku-4-5"
: difficulty === "medium" ? "claude-sonnet-4-6"
: "claude-opus-4-6";
Classification is a cheap Haiku call. Most queries are simple; you save money. Hard queries get the premium treatment.
Measure: does tiered routing actually improve your cost/quality position? Sometimes classification errors wipe out the gains.
Latency as a Third Axis
Cost-quality isn't enough; latency matters too. Examples where it dominates:
- Chat UX needs first token < 2s → rules out unthinking Opus
- Voice agent needs full response < 500ms → forces Haiku
- Background summarization: latency doesn't matter, optimize cost/quality only
Report 3-tuples: (quality, cost, p95 latency). The frontier in 3D is smaller; choose by which axis has a constraint.
Cache-Aware Selection
If you can cache 90% of your input:
- Effective input cost drops ~10×
- Long-context + cache often beats short-context + no cache at the same quality
- Larger models become more affordable per request
Decisions made without caching factored in are usually wrong. Re-measure with cache.
Sample Budget
Don't eval each config on 10,000 items. Start small:
- Shortlist with 50 items → narrow to 3 configs
- Confirm with 200 items → pick winner
- Validate in production with canary on 5% traffic → full rollout
Saves 10-100× on eval cost.
Anti-Patterns
- Choosing the biggest model by default — often dominated
- Ignoring cache costs — skews the whole picture
- One-dimensional optimization — quality-only misses cost blowouts
- Tiered routing without measuring — classification errors can negate gains
- Per-product model choices — you can't reuse infrastructure investment
- Skipping prompt levers — jumping to bigger model before trying few-shot
Best Practices
- Measure cost AND quality AND latency; plot the Pareto frontier
- Try prompt levers (few-shot, CoT, structure) before upsizing the model
- Compute break-even: is the quality gain worth the cost delta?
- Consider tiered routing; measure that it actually helps
- Always factor in cache reads/writes — they can 10× swing the decision
- Eval with small samples first (50-200); canary in prod before full rollout
- Revisit choices quarterly — pricing and model quality move