name: evals-skills description: >- Orchestrate LLM eval pipeline tasks: audit existing evals, analyze errors in traces, generate synthetic test data, write LLM judge prompts, validate evaluators against human labels, evaluate RAG pipelines, and build annotation interfaces. Based on hamelsmu/evals-skills (50+ company patterns). Use when the user asks for "eval audit", "error analysis", "judge prompt", "validate evaluator", "synthetic data", "evaluate RAG", "annotation interface", "review traces", "evals", or "LLM evaluation". Do NOT use for general code review (use backend-expert or frontend-expert), ML model training, unit testing (use qa-test-expert), or non-LLM evaluation tasks. Korean triggers: "LLM 평가", "eval 파이프라인".
Evals Skills — LLM Evaluation Pipeline Toolkit
Orchestrate LLM product evaluation tasks using 7 specialized sub-skills from hamelsmu/evals-skills.
Sub-Skill Index
| Sub-Skill | When to Use | Reference |
|---|---|---|
| eval-audit | Starting point: audit an eval pipeline or bootstrap from scratch | references/eval-audit.md |
| error-analysis | Read traces systematically and categorize failure modes | references/error-analysis.md |
| generate-synthetic-data | Bootstrap eval datasets when real traces are sparse | references/generate-synthetic-data.md |
| write-judge-prompt | Design binary pass/fail LLM-as-Judge for subjective criteria | references/write-judge-prompt.md |
| validate-evaluator | Calibrate LLM judges against human labels (TPR/TNR) | references/validate-evaluator.md |
| evaluate-rag | Evaluate retrieval and generation quality in RAG pipelines | references/evaluate-rag.md |
| build-review-interface | Build browser-based annotation interfaces for trace review | references/build-review-interface.md |
For writing guidelines when creating custom eval skills, see references/meta-skill.md. For learning resources and course links, see references/questions.md.
Subagent Model Routing
| Task Type | Model | Rationale |
|---|---|---|
| Exploration / search / file reading | haiku |
낮은 비용, read-only 작업 |
| Analysis / implementation / generation | sonnet |
균형 잡힌 품질-비용 비율 |
| Architecture / complex reasoning | opus |
복잡한 추론이 필요한 경우만 |
Agent tool 호출 시 반드시 model 파라미터를 지정한다.
Workflow
Step 1: Identify the Right Sub-Skill
Ask the user what they need or infer from context:
| User Intent | Sub-Skill |
|---|---|
| "Are my evals any good?" / No eval setup exists | eval-audit |
| "What's failing?" / Need to categorize failures | error-analysis |
| "I don't have enough test data" | generate-synthetic-data |
| "I need an LLM judge for X" | write-judge-prompt |
| "Is my judge accurate?" / Need TPR/TNR | validate-evaluator |
| "My RAG pipeline has issues" | evaluate-rag |
| "I need a UI to review traces" | build-review-interface |
Step 2: Read the Reference and Execute
Read the selected reference file and follow its instructions. Each reference contains the complete procedure: overview, prerequisites, core steps, and anti-patterns.
Step 3: Chain Sub-Skills as Needed
The recommended progression for a new eval pipeline:
error-analysis (or generate-synthetic-data if no traces)
-> write-judge-prompt (for subjective failure modes)
-> validate-evaluator (calibrate against human labels)
For RAG-specific pipelines, use evaluate-rag which covers both retrieval metrics and generation evaluation.
Use eval-audit at any point to check overall pipeline health.
Examples
Example 1: Audit an existing eval pipeline
User says: "We have some evals but I'm not sure they're catching real issues"
Actions:
- Read references/eval-audit.md
- Gather eval artifacts (traces, judge prompts, labeled data)
- Run 6 diagnostic checks (error analysis, evaluator design, judge validation, human review, labeled data, pipeline hygiene)
- Produce prioritized findings report with fixes
Result: Prioritized list of eval pipeline problems with concrete next steps.
Example 2: Build a judge for tone mismatch
User says: "Our support bot sometimes uses the wrong tone. I need an evaluator for that."
Actions:
- Read references/write-judge-prompt.md
- Define binary pass/fail criteria for tone matching
- Write judge prompt with task description, definitions, few-shot examples, structured output
- Recommend validation with references/validate-evaluator.md
Result: Binary pass/fail LLM judge prompt targeting tone mismatch, ready for validation.
Example 3: Bootstrap evals from scratch
User says: "We have no evals at all. Where do I start?"
Actions:
- Read references/eval-audit.md -- "No Eval Infrastructure" section
- If no production traces: read references/generate-synthetic-data.md to create test inputs
- Read references/error-analysis.md to categorize failures from traces
- For each failure mode needing judgment: use write-judge-prompt, then validate-evaluator
Result: End-to-end eval pipeline built from scratch following the recommended progression.
Error Handling
| Error | Action |
|---|---|
| No traces or eval artifacts available | Start with generate-synthetic-data to create test inputs |
| User wants a Likert scale (1-5) evaluator | Recommend binary pass/fail instead; explain via write-judge-prompt anti-patterns |
| Eval pipeline uses ROUGE/BERTScore as primary metric | Flag as a finding; recommend binary evaluators grounded in failure modes |
| No domain expert available for labeling | Minimum viable: one trusted person labels 50-100 traces |
| User wants to skip error analysis | Strongly recommend completing it first -- evaluators built without it measure generic qualities instead of actual failure modes |