eval-dataset-design - SKILL.md Agent Skill

name: eval-dataset-design description: | Design eval datasets that actually measure model quality — coverage, difficulty distribution, labeling consistency, and avoiding contamination. Covers sourcing, stratification, label quality, and when to generate vs curate. Use this skill when building a new eval set, realizing your current evals don't catch regressions, or labeling is inconsistent. Activate when: eval dataset, benchmark, test set, eval coverage, label quality, synthetic eval, dataset design.

Eval Dataset Design

Your evals are only as good as the dataset they run on. Miss a user scenario and you'll never catch regressions on it.

When to Use

Starting an eval program from zero
Your evals pass but users still hit issues → coverage gap
Labels are inconsistent across reviewers → quality problem
Adding evals for a new feature or domain

Dataset Properties Worth Optimizing

Coverage — representative of real user queries
Difficulty distribution — mix of easy/medium/hard, not all easy
Label consistency — two humans agree on the label
Stability — same inputs → same evaluable outputs over time
Uncontaminated — not in the model's training data

Sourcing Inputs

Best to worst:

Real user queries (anonymized) — highest signal
Synthetic queries generated from real templates — fills gaps
Adversarial queries hand-crafted for known failure modes
Existing benchmarks — context, but often contaminated and dated

A good eval set mixes all four. Typical split: 60% real, 20% synthetic, 15% adversarial, 5% benchmark.

Stratification

Split your dataset by categories that matter:

dataset:
  categories:
    simple_qa: 100 samples        # easy, high-frequency
    multi_step_reasoning: 50       # medium
    ambiguous_queries: 30          # hard
    edge_cases: 20                 # adversarial
    rare_domains: 20               # coverage of long tail

Report metrics per stratum, not just the aggregate. A model can improve on average while regressing on edge cases — you'll only see it stratified.

Labeling Quality

Two people label the same 50 items independently. Compute inter-annotator agreement:

from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(labeler_a, labeler_b)

Target:

κ > 0.8: excellent, labels are reliable
κ 0.6-0.8: good, some ambiguity
κ < 0.6: rewrite your labeling rubric — humans can't agree, so neither can models

Resolve disagreements with a tiebreaker, then update the rubric based on what caused disagreement.

Labeling Rubric

Write explicit guidelines with examples:

### Label: helpful

**Definition**: Response addresses the user's question directly and accurately.

**Examples**:
- Query: "How do I loop in Python?" / Response: Shows `for` loop → YES
- Query: "How do I loop in Python?" / Response: General loop theory → NO (dodges the specific language)
- Query: "Fix this bug" / Response: Points out the bug + fix → YES
- Query: "Fix this bug" / Response: "I'll need more info" (bug is in the code) → NO

If two labelers disagree, add their disputed case as a rubric example.

Contamination

If your eval is in the model's training data, scores are inflated. Check:

Hash the query and search public datasets / GitHub / web — common sources
Ask the model to complete the eval query's preamble — if it auto-completes with the expected answer, it's memorized
Regenerate with paraphrasing — rewrite queries so training data near-matches become mismatches

For production evals, rotate the dataset yearly and keep a private held-out set.

Difficulty Calibration

Track difficulty via model pass rate:

< 30% pass: too hard; models improve but you can't measure it
30-80% pass: useful range
> 95% pass: too easy; dataset has plateaued

Prune items that reach 100% for several consecutive model generations — they no longer discriminate.

Synthetic Generation

When you need more coverage:

const prompt = `Generate 20 diverse user queries that a customer support bot might receive.
Cover: billing (5), technical issues (5), account access (5), general FAQ (5).
Vary wording: formal, casual, angry, confused.
Return JSON array.`;

const response = await client.messages.create({
  model: "claude-opus-4-6",
  max_tokens: 4000,
  messages: [{ role: "user", content: prompt }],
});

Then:

Human-review every synthetic query for realism
Label them the same way as real queries
Track whether synthetic vs real have different score profiles — a red flag if they diverge

Size

Smoke test: 20-50 items, run in CI
Regression set: 200-500 items, run weekly
Full eval: 1000-5000 items, run per major release
Beyond that: sampling with stratification, not more volume

Quality > quantity. 200 well-labeled items beat 5000 noisy ones.

Versioning

Treat datasets like code:

evals/
  customer_support/
    v1/
      dataset.jsonl
      rubric.md
      CHANGELOG.md
    v2/
      dataset.jsonl
      rubric.md
      CHANGELOG.md

Never silently edit. Version bumps communicate "scores before v2 are not comparable to scores after".

Private Held-Out Set

Keep 100-200 items never published, never used for prompt iteration. Only for:

Measuring generalization on unseen examples
Catching overfitting to your public eval

Rotate a fraction yearly.

Anti-Patterns

Evals that only cover easy cases — model passes your eval, fails in prod
Single labeler — no way to know if labels are noisy
No versioning — silent edits invalidate historical trends
Overfit to eval — you tuned prompts against the eval set; now it doesn't generalize
All real or all synthetic — synthetic misses distribution, real misses edge cases

Best Practices

Mix real + synthetic + adversarial; stratify and report per category
Label with a written rubric; measure inter-annotator kappa
Check for training-data contamination; paraphrase suspicious queries
Track item-level difficulty; prune items that hit 100% pass
Version datasets like code; publish CHANGELOGs
Keep a private held-out set for generalization checks
Quality > quantity; 200 good labels beat 5000 rushed ones