llm-evaluation - SKILL.md Agent Skill

name: llm-evaluation description: Action-oriented advisor for evaluating LLMs — choosing and building benchmarks, perplexity, knowledge/reasoning (MMLU/GPQA/HLE), chat/preference eval (Chatbot Arena ELO, AlpacaEval, LLM-as-judge), agentic benchmarks (SWE-bench), safety/jailbreak eval, ecological validity, and decontamination. Use when deciding how to measure a model, picking the right benchmark for a goal, building an eval harness, or debugging contaminated/gamed scores. metadata: source: Stanford CS336 (Spring 2026) Lecture 12 — Evaluation promptSignals: phrases: - "evaluate the model" - "which benchmark" - "build an eval" - "perplexity" - "MMLU" - "LLM as judge" - "chatbot arena" - "SWE-bench" - "benchmark contamination" - "decontamination" minScore: 4

llm-evaluation — measuring LLMs well

You help engineers choose, build, and interpret evaluations. Core principle: evaluation turns an abstract construct into a concrete prompt-and-score procedure, and what you measure is what you optimize — so design the metric to match the real goal.

Mental models (hold these first)

Construct → procedure gap. Every benchmark operationalizes a fuzzy construct ("reasoning," "helpfulness"). Always ask what construct you actually care about, then pick the closest valid procedure.
Four families, increasing realism/cost: perplexity → multiple-choice knowledge/reasoning → open-ended/chat preference → agentic tasks. Realism rises, automatic-scorability falls.
Contamination is the default failure. If test data leaked into training, scores are meaningless. Assume contamination until shown otherwise.
Judges have biases. Human and LLM judges skew toward length, style, and sycophancy.

Procedure — pick the eval for your goal

State the decision the eval informs (ship? pick model? measure an algorithm?).
Match family:
- Track training / scaling → perplexity (probability on held-out data; approaches true entropy; underlies cloze tasks like LAMBADA). Guard against invalid "distributions" on leaderboards.
- Knowledge & reasoning → MMLU (now saturated), GPQA (Google-proof, PhD-level; experts ≈65%), Humanity's Last Exam. Prefer benchmarks with held-out private sets.
- Assistant quality → Chatbot Arena (pairwise human preference → ELO), AlpacaEval (win-rate vs baseline). Expect judge bias.
- Real capability → agentic: SWE-bench (real GitHub issues), cyber CTF, long-horizon tasks. Score depends on scaffold/delegation/context-management, not just the base model.
- Frontier reasoning → ARC-AGI-style abstract reasoning.
Add safety eval (jailbreaks, contextual harm) — note safety correlates with capability.
Check ecological validity — does it resemble real query streams (e.g. medical/economic query-level benchmarks)?

Procedure — make results trustworthy

Decontaminate: check train/test overlap; use model cutoff dates; for code, prefer private/internal repos not on the web.
Read the outputs. Always inspect raw generations — agentic benchmarks can be gamed (e.g. an empty response scoring ~38%).
Harden judges: rubrics/checklists, multiple judges + ensembling, fixed prompts; remember LLM-judge ≠ better unless mimicking a target population.
Report honestly: state the construct, the scoring rule, and known contamination/judge caveats.

Cheatsheet

Goal	Use	Watch
Scaling/training signal	perplexity	invalid distributions
Knowledge/reasoning	GPQA / HLE (private split)	MMLU saturation, contamination
Assistant preference	Arena ELO / AlpacaEval	length & sycophancy bias
Real engineering ability	SWE-bench / agentic	scaffold-dependence, gaming
Algorithm (not model)	nanoGPT speedrun-style	apples-to-apples setup

Pitfalls

Optimizing a saturated or contaminated benchmark.
Trusting an aggregate score without reading transcripts.
Using multiple-choice as a proxy for real-world open-ended usage.
Treating an LLM judge as ground truth.

Derived from Stanford CS336 Spring 2026, Lecture 12 (Evaluation). Transcript: yt2md/docs/transcripts/…lecture-12-evaluation*; index in ~/Desktop/youtube/.