name: eval-coach description: Guide users through building comprehensive AI evaluation strategies using Evaluation-Driven Development (EDD) metadata: version: 1.0.0 author: AI Product Engineer School triggers: - eval - evaluation - testing strategy - test cases - LLM testing
Eval Coach
An Agent Skill for designing comprehensive AI evaluation strategies using Evaluation-Driven Development (EDD).
Overview
Eval Coach guides you through a structured 5-step framework for evaluating LLM applications:
- Define Success - Map business goals to measurable metrics
- Design Dataset - Create diverse test cases (happy path, edge cases, adversarial)
- Select Methods - Choose Automated, LLM-as-Judge, or Human evaluation
- Plan Automation - Integrate evals into CI/CD
- Monitor Production - Track drift and collect feedback
When to Use This Skill
Invoke this skill when:
- Starting a new AI project and need an evaluation strategy
- Improving an existing agent's reliability
- Comparing different implementation approaches (e.g., LangGraph vs Deep Agents)
- Setting up CI/CD for AI products
- Debugging production quality issues
Evaluation Philosophy
The 50-40-10 Rule
- 50% Automated - Schema validation, keyword checks, latency ($0.00/run)
- 40% LLM-as-Judge - Semantic quality, relevance ($0.01-0.05/run)
- 10% Human - Subjective quality, edge cases ($5-50/run)
Start Small, Iterate
- Begin with 20 high-quality test cases, not 1000 noisy ones
- Distribution: 50% happy path, 35% edge cases, 15% adversarial
- Add cases as you discover production failures
Guidance Framework
Step 1: Define Success
Ask the user:
- What is your product's primary goal?
- What does success look like for a user?
- What are the failure modes that would hurt users or business?
- How would you manually judge if an output is "good"?
From answers, help define:
- Primary metrics (e.g., accuracy, relevance, helpfulness)
- Secondary metrics (e.g., latency, cost, safety)
- Threshold targets (e.g., 95% accuracy, <5s latency)
Step 2: Design Dataset
Guide creation of test cases:
# Test Case Template
name: descriptive_name
category: happy_path | edge_case | adversarial
inputs:
# The inputs your agent receives
query: "user query here"
context: "any context"
outputs:
# What to validate
expected_fields: [field1, field2]
should_mention: [keyword1, keyword2]
should_not_contain: [forbidden_term]
min_length: 100
max_length: 5000
Categories:
- Happy Path (50%): Common, expected inputs
- Edge Cases (35%): Unusual but valid inputs, boundary conditions
- Adversarial (15%): Invalid inputs, prompt injection, error conditions
Step 3: Select Methods
Match methods to evaluation needs:
| What to Measure | Method | Cost | When |
|---|---|---|---|
| Schema/format | Automated | Free | Always (CI) |
| Keywords present | Automated | Free | Always (CI) |
| Semantic quality | LLM-as-Judge | $0.01-0.05 | Pre-deploy |
| Relevance to input | LLM-as-Judge | $0.01-0.05 | Pre-deploy |
| Subjective quality | Human | $5-50 | Edge cases |
| Safety/compliance | Human + Automated | Varies | Always |
Step 4: Plan Automation
Integration tiers:
Tier 1: PR-Level (<5 min)
- Automated tests only
- Run on every PR
- Block merge on failure
Tier 2: Pre-Deploy (15-30 min)
- Full test suite including LLM-as-Judge
- Run before production deployment
- Compare to baseline metrics
Tier 3: Production Monitoring (Continuous)
- Sample real traffic for evaluation
- Track drift over time
- Alert on metric degradation
Step 5: Monitor Production
Track these signals:
- Data Drift - Input distribution changing
- Concept Drift - User expectations changing
- Model Drift - Provider silently updating model
- Task Drift - Users asking for new capabilities
Recommendation: Pin model versions, run weekly evals on production samples.
Output Format
After completing the framework, provide:
## Evaluation Plan for [Product Name]
### Business Objectives
- Primary goal: [goal]
- Success criteria: [criteria]
### Dataset Strategy
- Total test cases: [N]
- Happy path: [N1] cases
- Edge cases: [N2] cases
- Adversarial: [N3] cases
### Evaluation Methods
| Metric | Type | Method | Threshold |
|--------|------|--------|-----------|
| ... | ... | ... | ... |
### CI/CD Integration
- PR checks: [list]
- Pre-deploy: [list]
- Monitoring: [list]
### Next Steps
1. [First action]
2. [Second action]
3. [Third action]
Templates
This skill includes starter templates in the templates/ directory:
dataset.py- LangSmith dataset creationevaluators.py- Common evaluator implementationscompare.py- Experiment comparison utilities
Examples
See examples/ for complete evaluation plans:
research_squad_eval.md- Multi-agent research system evaluation