name: eval-dataset-design description: | Design eval datasets that actually measure model quality — coverage, difficulty distribution, labeling consistency, and avoiding contamination. Covers sourcing, stratification, label quality, and when to generate vs curate. Use this skill when building a new eval set, realizing your current evals don't catch regressions, or labeling is inconsistent. Activate when: eval dataset, benchmark, test set, eval coverage, label quality, synthetic eval, dataset design.
Eval Dataset Design
Your evals are only as good as the dataset they run on. Miss a user scenario and you'll never catch regressions on it.
When to Use
- Starting an eval program from zero
- Your evals pass but users still hit issues → coverage gap
- Labels are inconsistent across reviewers → quality problem
- Adding evals for a new feature or domain
Dataset Properties Worth Optimizing
- Coverage — representative of real user queries
- Difficulty distribution — mix of easy/medium/hard, not all easy
- Label consistency — two humans agree on the label
- Stability — same inputs → same evaluable outputs over time
- Uncontaminated — not in the model's training data
Sourcing Inputs
Best to worst:
- Real user queries (anonymized) — highest signal
- Synthetic queries generated from real templates — fills gaps
- Adversarial queries hand-crafted for known failure modes
- Existing benchmarks — context, but often contaminated and dated
A good eval set mixes all four. Typical split: 60% real, 20% synthetic, 15% adversarial, 5% benchmark.
Stratification
Split your dataset by categories that matter:
dataset:
categories:
simple_qa: 100 samples # easy, high-frequency
multi_step_reasoning: 50 # medium
ambiguous_queries: 30 # hard
edge_cases: 20 # adversarial
rare_domains: 20 # coverage of long tail
Report metrics per stratum, not just the aggregate. A model can improve on average while regressing on edge cases — you'll only see it stratified.
Labeling Quality
Two people label the same 50 items independently. Compute inter-annotator agreement:
from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(labeler_a, labeler_b)
Target:
- κ > 0.8: excellent, labels are reliable
- κ 0.6-0.8: good, some ambiguity
- κ < 0.6: rewrite your labeling rubric — humans can't agree, so neither can models
Resolve disagreements with a tiebreaker, then update the rubric based on what caused disagreement.
Labeling Rubric
Write explicit guidelines with examples:
### Label: helpful
**Definition**: Response addresses the user's question directly and accurately.
**Examples**:
- Query: "How do I loop in Python?" / Response: Shows `for` loop → YES
- Query: "How do I loop in Python?" / Response: General loop theory → NO (dodges the specific language)
- Query: "Fix this bug" / Response: Points out the bug + fix → YES
- Query: "Fix this bug" / Response: "I'll need more info" (bug is in the code) → NO
If two labelers disagree, add their disputed case as a rubric example.
Contamination
If your eval is in the model's training data, scores are inflated. Check:
- Hash the query and search public datasets / GitHub / web — common sources
- Ask the model to complete the eval query's preamble — if it auto-completes with the expected answer, it's memorized
- Regenerate with paraphrasing — rewrite queries so training data near-matches become mismatches
For production evals, rotate the dataset yearly and keep a private held-out set.
Difficulty Calibration
Track difficulty via model pass rate:
< 30%pass: too hard; models improve but you can't measure it30-80%pass: useful range> 95%pass: too easy; dataset has plateaued
Prune items that reach 100% for several consecutive model generations — they no longer discriminate.
Synthetic Generation
When you need more coverage:
const prompt = `Generate 20 diverse user queries that a customer support bot might receive.
Cover: billing (5), technical issues (5), account access (5), general FAQ (5).
Vary wording: formal, casual, angry, confused.
Return JSON array.`;
const response = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 4000,
messages: [{ role: "user", content: prompt }],
});
Then:
- Human-review every synthetic query for realism
- Label them the same way as real queries
- Track whether synthetic vs real have different score profiles — a red flag if they diverge
Size
- Smoke test: 20-50 items, run in CI
- Regression set: 200-500 items, run weekly
- Full eval: 1000-5000 items, run per major release
- Beyond that: sampling with stratification, not more volume
Quality > quantity. 200 well-labeled items beat 5000 noisy ones.
Versioning
Treat datasets like code:
evals/
customer_support/
v1/
dataset.jsonl
rubric.md
CHANGELOG.md
v2/
dataset.jsonl
rubric.md
CHANGELOG.md
Never silently edit. Version bumps communicate "scores before v2 are not comparable to scores after".
Private Held-Out Set
Keep 100-200 items never published, never used for prompt iteration. Only for:
- Measuring generalization on unseen examples
- Catching overfitting to your public eval
Rotate a fraction yearly.
Anti-Patterns
- Evals that only cover easy cases — model passes your eval, fails in prod
- Single labeler — no way to know if labels are noisy
- No versioning — silent edits invalidate historical trends
- Overfit to eval — you tuned prompts against the eval set; now it doesn't generalize
- All real or all synthetic — synthetic misses distribution, real misses edge cases
Best Practices
- Mix real + synthetic + adversarial; stratify and report per category
- Label with a written rubric; measure inter-annotator kappa
- Check for training-data contamination; paraphrase suspicious queries
- Track item-level difficulty; prune items that hit 100% pass
- Version datasets like code; publish CHANGELOGs
- Keep a private held-out set for generalization checks
- Quality > quantity; 200 good labels beat 5000 rushed ones