name: llm-eval-grounded-theory description: Runs the qualitative-to-quantitative LLM evaluation pipeline — open coding, axial coding, synthetic strata coverage, analytic rubric design, double-labeled gold sets, LLM-as-judge calibration, and continuous monitoring integration. Use when building eval datasets, failure taxonomies, rubrics, golden sets, judge calibration, shadow mode, or production eval regression loops for agentic/LLM systems. disable-model-invocation: true paths: - docs/recipes/goaljudge/** - docs/research/goaljudge* - docs/plans/goaljudge* - services/governance/goaljudge* - components/goal_judge.py
LLM Eval Grounded-Theory Pipeline
Handbook for building a trustworthy LLM-as-judge from real traces: qualitative error analysis → taxonomy → rubric → gold set → calibration → production monitoring.
Deep reference: reference.md — IAA tables, enable-policy template, bias catalog, bibliography.
Worked example (this repo): examples-goaljudge.md
Docs mirror: docs/skills/llm-eval-grounded-theory/ · implementation plan
When to use
Apply this skill when:
- Starting eval work on a new agent feature or product surface
- Designing a failure taxonomy, rubric, or golden dataset
- Calibrating an LLM judge before it gates outcomes
- Integrating offline regression + online monitoring
Do not skip error analysis before writing judge prompts. Practitioner consensus (R1–R3): error analysis is the most important step — more important than the judge itself.
Cardinal rules
- Trace is ground truth; narration is a suspect claim. Ground verdicts in observable tool outputs, state, and termination — not agent prose (R10).
- LLM proposes, human disposes. Never delegate first-pass open coding to an LLM (R3).
- Three orthogonal axes: agent behavior / environment confound / judge reliability. Never fold confounds into agent-failure counts.
- Criteria drift is structural. Budget re-coding loops; freeze the test split each cycle (R4).
- Class-specific metrics, not accuracy. Gate on precision/recall of the action-triggering class (R15, R19).
- Default-off until calibrated. Shadow/telemetry first; action gates only after enable-policy clears.
Pipeline overview
flowchart TD
S0["Stage 0: Trace collection"]
S1["Stage 1: Open coding"]
S2["Stage 2: Axial coding + taxonomy"]
S3["Stage 3: Synthetic data"]
S4["Stage 4: Rubric design"]
S5["Stage 5: Gold set + IAA"]
S6["Stage 6: Judge calibration"]
S7["Stage 7: Continuous monitoring"]
GATE["Enable-policy gates"]
S0 --> S1 --> S2 --> S3 --> S4 --> S5 --> S6 --> GATE
GATE -->|pass| S7
GATE -->|fail| SHADOW["Shadow / telemetry only"]
S7 --> LOOP["Steady-state refresh"]
LOOP -.-> S1
LOOP -.-> S5
S2 -. criteria drift .-> S1
S6 -. rubric ambiguity .-> S4
Stage 0: Trace collection
Goal: Collect annotatable trajectories before any coding.
Checklist
- Full trajectories: tool inputs + outputs + intermediate state + final answer (not final answer alone)
- Environment posture verified before coding (garbage-in guard — blocked tools, missing paths invalidate failure counts)
- Cheap trace viewer or export for domain experts (R3)
- Stable IDs on every run (
trace_id,user_id,task_id) for join integrity - Separate coding sample (~100 representative) from gold-set sample (stratified later)
Artifacts
Flat table: trace_id, task_input, final_answer, evidence digest, empty open_code_note.
Stage 1: Open coding
Goal: Inductive notes on traces before imposing categories.
Checklist
- Domain expert reads ≥100 traces end-to-end
- Brief open-ended notes per trace (Langfuse
TEXTscores work well — R13) - First-failure discipline: code only the first deviation as primary (R2)
- Resist priors; seed codes are bootstrap only — let categories emerge (R12)
- Track saturation: stop when ~20 consecutive traces add no new code type (R2)
- Initial pass is human-only — no LLM first pass (R3)
Optional assist (Stages 1–2)
- LLM-assisted codebook generation (R21) — human validates all codes
- Open-code diagnostics without ground truth: coverage, novelty, coherence (R22)
Artifact template
| code | definition | first_seen_case | provenance |
|---|---|---|---|
| production / synthetic |
Stage 2: Axial coding + failure taxonomy
Goal: Cluster open codes into testable failure categories; pick one mode for first rubric.
Workflow
- Cluster agent-behavior codes only into 5–6 named, testable categories
- Export notes → LLM proposes clusters → human renames, rejects untestable buckets (R3, R12)
- Split non-behavior codes: confound (environment) vs judge-reliability (verdict defect)
- Build per-case axial matrix with first-failure discipline
- Frequency tables on environment-corrected rows only
- IAA on category assignment (see reference.md)
- Pick one top failure mode (biggest + cleanest); gate before rubric work
Gate before Stage 4
- Top mode selected with documented rationale
- IAA ≥ 0.80 on category assignment (or pre-declared threshold)
- Confounds documented and excluded from frequency ranking
Optional assist
Ensemble-LM axial coding (R23): clustering + intrinsic metrics alongside human alignment.
Stage 3: Synthetic data for scarce strata
Goal: Cover rare failure modes production traces won't supply.
Principles (R3, R16)
- Generate inputs, not outputs — run through real system
- Structured dimensions (domain × feasibility × target behavior × stratum), not "give me test queries"
- Hybrid: live runs for elicitable modes; deterministic fixtures for rare/dangerous strata (CoT-gaming, premature-impossible)
- Coverage verification: record mismatches as data; do not re-roll until target code appears
- Contamination firewall: synthetic → dev split only; never held-out test
Artifact template
Dimension spec (D1–Dn), prompt matrix, provenance tags (live / synthetic), coverage map per stratum.
Seed codes: see reference.md — use as bootstrap only.
Stage 4: Rubric design
Goal: Analytic, evidence-grounded criteria encoded as judge prompt + output schema.
Principles
| Principle | Rationale |
|---|---|
| Analytic (criterion-by-criterion) | Prevents halo/conflation; per-criterion κ (R5) |
| Evidence-grounded | Observable span per criterion; anti-gaming (R10) |
| Binary pass/fail | Likert 3.2–3.8 clusters poorly; weak calibration signal (R15) |
| Conservative binarization | Partial/impossible → fail for action-gating; metadata separate |
| Anchor examples | Concrete pass/fail exemplars lock boundaries (R6) |
| Co-construct with labels | Validate against gold-set labels, not in vacuum (R4) |
Ship posture
- PROVISIONAL rubric → shadow/telemetry immediately
- Confirmation gate → action stays off until Stage 6 clears enable-policy
- Split: Code gate (ship provisional prompt) vs Confirmation gate (scientific establishment)
Artifact template
Rubric spec: criterion ↔ taxonomy category ↔ testable check ↔ expected verdict fields.
Stage 5: Gold set creation
Goal: Stratified, double-labeled trust anchor for calibration.
Contract
- Size:
200–300 items (250 for ~80% agreement at 95% CI — R19) - Stratification: taxonomy defines strata; oversample action-trigger class
- Split: ~60/40 dev/test; never tune on test
- Double-label + adjudicate; α ≥ 0.80 on primary gate field (R7, R20)
- Multi-axis schema: gate field + graceful-failure/partial metadata +
failure_mode+ evidence spans - Provenance: tag
productionvssynthetic; report production-only test metrics - Version:
gold-v1,gold-v2— label changes are production risk (R15)
Three-tier rollout
- Pilot (~50) — instrument validation
- Confirmation — κ + behavioral shadow
- Full gold set — calibration on frozen test
Stage 6: LLM judge calibration
Goal: Validate judge against gold set; earn enable-policy clearance.
Default path: prompt + rubric (not fine-tuning)
| Approach | When |
|---|---|
| Prompt + few-shot calibration | Starting out; human corrections → exemplars (R24) |
| Fine-tuning / distillation | Only after prompt path plateaus; larger labeled set |
Target 75–90% judge–human alignment on pilot before scaling (R15).
Metrics to report
- Precision / recall / F1 on action-triggering negative class
- κ or α vs human labels (prerequisite before trusting P/R)
- ECE diagnostic only (R18)
- CoT-gaming red-team flip rate (R10)
- Per-
failure_modebreakdown
Enable-policy
Use reference.md template. Example precision-first profile:
| Gate | Threshold |
|---|---|
| Precision on trigger class | ≥ 0.90 |
| False-action rate on clean successes | ≤ 2% |
| Recall on trigger class | ≥ 0.70 |
| Red-team flip rate | ≤ 5% |
| Judge–human κ | ≥ 0.6 |
| Posture | shadow/off until all met |
Rollout
Shadow → dev eval enable → production enable. Never iterate prompt on test split.
Stage 7: Continuous monitoring
Goal: Offline regression + online drift detection after gates clear.
flowchart LR
subgraph offline [Offline]
CI["CI/PR: golden dataset"]
SCHED["Scheduled re-run"]
end
subgraph online [Production]
L1["L1: sync checks 100%"]
L2["L2: async judge 5-10%"]
L3["L3: CUSUM drift"]
end
PROD["Traces"] --> L1
PROD --> L2
PROD --> L3
L2 --> GOLD["Add to gold set"]
GOLD --> CI
Operational loops
- Every production failure → candidate golden entry after human review
- Version datasets; treat label changes as production risk
- Quarterly gold refresh; alert if κ drops below floor
- Per-category fail rates — not single global threshold (R15)
- Criteria drift in prod → re-open-code (Stage 1)
Details: reference.md monitoring stack.
Master workflow checklist
Copy and track across the engagement:
Stage 0 — Traces
- [ ] Full trajectories exported with stable IDs
- [ ] Environment posture verified
- [ ] Trace viewer available to annotators
Stage 1 — Open coding
- [ ] ≥100 traces coded with open notes
- [ ] First-failure discipline applied
- [ ] Saturation log shows ~20 traces with no new codes
Stage 2 — Axial coding
- [ ] Agent-behavior taxonomy (5–6 testable categories)
- [ ] Confounds and judge errors split out
- [ ] Frequency table on corrected rows
- [ ] IAA on categories ≥ threshold
- [ ] Top mode selected for rubric
Stage 3 — Synthetic
- [ ] Dimension spec for scarce strata
- [ ] Coverage map complete
- [ ] Contamination firewall enforced (dev only)
Stage 4 — Rubric
- [ ] Analytic, evidence-grounded criteria in prompt
- [ ] Anchor examples included
- [ ] PROVISIONAL rubric in shadow mode
Stage 5 — Gold set
- [ ] ~250 stratified, double-labeled items
- [ ] α ≥ 0.80 on gate field
- [ ] Frozen test split hashed
Stage 6 — Calibration
- [ ] Class-specific P/R/F1 on trigger class
- [ ] Red-team flip rate measured
- [ ] All enable-policy gates pass on test split
Stage 7 — Monitoring
- [ ] CI golden regression wired
- [ ] Production sampling + drift alerts live
- [ ] Trace-to-dataset loop operational
Anti-patterns
| ID | Anti-pattern | Fix |
|---|---|---|
| AP-1 | Skip open coding; jump to judge prompt | Error analysis first (R1–R3) |
| AP-2 | Count environment blocks as agent failures | Three-axis split |
| AP-3 | Global accuracy as gate metric | Class-specific P/R on trigger class |
| AP-4 | Tune rubric on test split | Dev only; freeze test |
| AP-5 | Synthetic data in held-out test | Dev split only |
| AP-6 | Trust judge confidence / ECE for gating | κ/α + class P/R only |
| AP-7 | Always-pass judge in production | Shadow until gates clear |
| AP-8 | Holistic Likert rubric | Analytic binary criteria |
| AP-9 | Re-roll synthetic until code appears | Record mismatches; fix dimensions |
| AP-10 | Delegate first-pass open coding to LLM | Human first pass only |
Testing pyramid (agentic systems)
When implementing in a layered codebase:
- L1: Pure schema validation of verdict + label types
- L2: Record/replay of eval capture; dataset CRUD; mock providers only
- L3: Mocked judge + offline red-team fixtures; rubric evals marked slow
- L4: Gate failure-mode matrix; rejection tests before acceptance
Never run live LLM calls in default CI. Live flip-rate diagnostics are opt-in only.
Additional resources
- Metrics, IAA, enable-policy, bibliography: reference.md
- GoalJudge worked example (this repo): examples-goaljudge.md