name: evaluation-testing description: > Use this skill to design and execute evaluation frameworks for LLM agents, implement trajectory testing, deploy LLM-as-judge patterns, build automated eval pipelines, and integrate agent testing into CI/CD workflows. This skill enforces: structured behavioral assertions, trajectory-vs-outcome evaluation matrices, verifier agent topologies, regression detection baselines, hallucination scoring engines, and benchmark dataset lifecycle management. Do NOT use for: unit testing traditional software, load/performance testing infrastructure, or model fine-tuning data preparation. version: "2.0.0" author: "j4flmao" license: "MIT" type: skill compatibility: claude-code: true cursor: true codex: true windsurf: true tags: [harness-engineering, evaluation-testing, agent-frameworks, llm-judge, ci-cd, benchmarks]
Evaluation Testing Skill
Purpose
Provides a production-grade evaluation and testing framework for LLM agent systems. Enables teams to measure agent correctness across behavioral dimensions, detect regressions in multi-step reasoning chains, score hallucination severity, and embed automated evaluation gates into deployment pipelines. This system handles the fundamental non-determinism of LLM outputs by combining trajectory-level analysis, outcome-level assertions, and LLM-as-judge consensus protocols into a unified testing harness.
Core Principles
- Trajectory Over Outcome: Evaluate the reasoning path, not just the final answer. An agent that reaches the correct output through flawed reasoning is a latent failure.
- Statistical Significance Over Single Runs: Agent evaluations must use repeated sampling ($N \ge 5$) and report confidence intervals, never single-shot pass/fail assertions.
- Human-Aligned Judging: LLM-as-judge evaluators must be calibrated against human preference baselines using Cohen's Kappa ($\kappa \ge 0.60$) before deployment.
- Regression Baselines Are Sacred: Every eval suite must maintain versioned baseline snapshots. Regressions are detected against these baselines, not arbitrary thresholds.
- Eval Datasets Are Living Assets: Test datasets must be versioned, deduplicated, stratified by difficulty, and refreshed on a scheduled cadence to prevent benchmark overfitting.
Agent Protocol
Triggers
Use this skill when processing:
- Agent output quality assessments requiring structured scoring rubrics.
- Multi-step trajectory evaluations for tool-calling or chain-of-thought agents.
- CI/CD pipeline gates that must block deployments on eval regressions.
- Hallucination detection across factual claims, code generation, or document summarization.
- Benchmark design for comparing agent architectures or prompt strategies.
- Dataset curation for evaluation test suites.
Input Context Required
- Agent Outputs: Raw completions, tool call traces, or conversation transcripts to evaluate.
- Evaluation Rubric: A structured scoring guide defining dimensions (correctness, helpfulness, safety, coherence).
- Baseline Metrics: Historical eval scores from the previous accepted version.
- Ground Truth Dataset: Labeled examples with expected outputs or acceptable output ranges.
- Target Confidence Level ($\alpha$): The statistical significance threshold (typically 0.05).
Output Artifact
- Evaluation Report: JSON document containing per-dimension scores, aggregate metrics, and statistical tests.
- Regression Verdict: Binary pass/fail with confidence intervals and effect sizes.
- Hallucination Audit Log: Itemized list of factual claims with verification status and source attributions.
Response Formats
For programmatic integration, evaluation results must be delivered in this format:
{
"eval_run_id": "eval-2026-06-04-001",
"model_version": "agent-v2.3.1",
"dimensions": {
"correctness": { "mean": 0.87, "ci_lower": 0.82, "ci_upper": 0.92, "n": 200 },
"helpfulness": { "mean": 0.91, "ci_lower": 0.87, "ci_upper": 0.95, "n": 200 },
"safety": { "mean": 0.99, "ci_lower": 0.97, "ci_upper": 1.00, "n": 200 }
},
"regression_detected": false,
"hallucination_rate": 0.034,
"verdict": "PASS",
"baseline_comparison": {
"previous_version": "agent-v2.3.0",
"delta_correctness": +0.02,
"p_value": 0.12
}
}
Decision Matrix for Evaluation Strategy
What kind of agent output are you evaluating?
├── Single-Turn Q&A / Factual Responses
│ ├── Ground truth available?
│ │ ├── Yes → Exact Match / F1 / BLEU + Hallucination Scoring
│ │ └── No → LLM-as-Judge (Pointwise) + Human Calibration
│ │
├── Multi-Step Tool-Calling Chains
│ ├── Trajectory matters?
│ │ ├── Yes → Trajectory Evaluation (step-level scoring)
│ │ └── No → Outcome-Only Evaluation (final state diff)
│ │
├── Code Generation
│ ├── Executable test cases available?
│ │ ├── Yes → Execution-Based Pass@k Scoring
│ │ └── No → LLM-as-Judge (Pairwise Comparison)
│ │
└── Long-Form Content / Summarization
├── Reference summaries available?
│ ├── Yes → ROUGE-L + BERTScore + Faithfulness Check
│ └── No → LLM-as-Judge (Rubric-Based) + Hallucination Audit
Detailed Architectural Overview
The evaluation testing framework operates as a pipeline from agent output collection through scoring, aggregation, and regression analysis.
+----------------+ +-----------------+ +------------------+ +-------------------+
| Agent Runtime | ──► | Trace Collector | ──► | Eval Dispatcher | ──► | Scoring Engines |
| (completions) | | (trajectories) | | (routes by type) | | (judge/metric/exec)|
+----------------+ +-----------------+ +------------------+ +-------------------+
│
▼
+----------------+ +-------------------+
| CI/CD Gateway | ◄─────────────────────────────────────────────────── | Aggregator & |
| (pass/fail) | | Regression Tester |
+----------------+ +-------------------+
Evaluation Lifecycle
[Agent Produces Output]
│
├──► (A) Trace Collection ──► Captures tool calls, reasoning steps, final output
│
├──► (B) Eval Routing ──► Matches output type to scoring strategy (judge/metric/exec)
│
├──► (C) Multi-Dimensional Scoring ──► $S_d = \frac{1}{N}\sum_{i=1}^{N} J_d(o_i, r_i)$
│
├──► (D) Statistical Aggregation ──► Computes means, CIs, effect sizes (Cohen's d)
│
└──► (E) Regression Test ──► Two-sample t-test against baseline: $t = \frac{\bar{X}_1 - \bar{X}_2}{s_p\sqrt{2/n}}$
Workflow Steps
Phase 1: Trace Collection & Dataset Preparation
- Instrument Agent Runtime: Attach trace collectors to capture every tool call, reasoning step, and final output in structured format.
- Load Evaluation Dataset: Pull versioned test cases from the dataset registry with stratified sampling by difficulty tier.
- Generate Agent Outputs: Execute the agent against all test cases with temperature fixed and random seed locked for reproducibility.
- Serialize Trajectories: Store complete execution traces (inputs, intermediate states, outputs) in JSONL format.
Phase 2: Scoring Engine Selection
- Classify Output Type: Determine whether each test case requires metric-based, execution-based, or judge-based evaluation.
- Load Rubric Definitions: Bind dimension-specific scoring rubrics (correctness, helpfulness, safety, coherence) to the eval dispatcher.
- Configure Judge Models: Initialize LLM-as-judge instances with calibrated system prompts and few-shot examples.
- Set Sampling Parameters: Configure $N$ judge samples per item for consensus scoring.
Phase 3: Multi-Dimensional Evaluation
- Execute Metric Evaluations: Run deterministic metrics (F1, BLEU, ROUGE-L, BERTScore) on applicable test cases.
- Execute Judge Evaluations: Route subjective dimensions through LLM-as-judge with structured output schemas.
- Execute Trajectory Evaluations: Score step-by-step reasoning chains against golden trajectories.
- Run Hallucination Detection: Extract factual claims and verify against source documents.
Phase 4: Statistical Aggregation
- Compute Dimension Means: Calculate per-dimension score averages with bootstrap confidence intervals.
- Compute Effect Sizes: Calculate Cohen's d between current and baseline score distributions.
- Run Normality Tests: Apply Shapiro-Wilk test to determine appropriate statistical comparison method.
- Generate Score Distributions: Plot histograms and box plots for each evaluation dimension.
Phase 5: Regression Detection
- Load Baseline Snapshots: Retrieve the most recent accepted baseline from the eval registry.
- Execute Statistical Tests: Run paired t-tests or Wilcoxon signed-rank tests comparing current vs. baseline.
- Apply Holm-Bonferroni Correction: Correct for multiple comparisons across evaluation dimensions.
- Render Regression Verdict: Emit PASS/FAIL based on corrected p-values and minimum effect size thresholds.
Phase 6: CI/CD Integration & Reporting
- Publish Eval Report: Write structured JSON report to the CI artifact store.
- Update Baseline Registry: If verdict is PASS and metrics improve, promote current scores to the new baseline.
- Gate Deployment: Block or allow deployment based on regression verdict and mandatory dimension thresholds.
- Alert on Degradations: Send notifications for statistically significant regressions exceeding alert thresholds.
Extended Troubleshooting Guide
When implementing evaluation testing frameworks, you may encounter the following common failure modes:
| Symptom | Primary Cause | Mitigation Action |
|---|---|---|
| High variance in LLM-as-judge scores | Judge prompt is underspecified or lacks calibration examples. | Add 3-5 few-shot examples covering edge cases and re-calibrate against human labels. |
| False regression alerts on every run | Baseline was captured from a single run without confidence intervals. | Re-capture baseline using $N \ge 50$ samples and store distribution parameters. |
| Hallucination scorer flags correct outputs | Verification source documents are incomplete or outdated. | Expand source corpus and add a confidence threshold ($\tau \ge 0.8$) before flagging. |
| CI pipeline times out during eval | Full eval suite runs against entire dataset on every commit. | Implement tiered eval: fast smoke tests on PR, full suite on merge to main. |
| Judge model agrees with itself (self-bias) | Same model used for generation and judging. | Use a different model family for judging or implement cross-model consensus. |
| Eval scores plateau despite agent improvements | Benchmark saturation — test cases are too easy. | Refresh dataset with adversarial examples targeting known failure modes. |
| Trajectory eval misses semantic equivalence | Step comparison uses exact string matching. | Use semantic similarity (cosine $\ge 0.85$) for step-level comparison instead. |
Complete Evaluation Pipeline Scenario
Below is a typical end-to-end evaluation execution for a code-generation agent:
[PR Opened] ──► CI Trigger fires
│
[Stage 1] ──► Load 50-case smoke test dataset ──► Run agent on all cases
│
[Stage 2] ──► Route: 30 exec-based (pass@1) + 20 judge-based (correctness)
│ │
[Stage 3] ──► pass@1 = 0.83 (baseline: 0.81) judge_mean = 4.2/5 (baseline: 4.1/5)
│ │
[Stage 4] ──► Paired t-test: p=0.23 (not significant) p=0.34 (not significant)
│
[Stage 5] ──► Verdict: PASS ──► Merge allowed ──► Full eval queued on main branch
Rules and Guidelines
- Rule 1: Never evaluate agent outputs with a single sample. All eval dimensions must use $N \ge 5$ samples with reported confidence intervals.
- Rule 2: LLM-as-judge prompts must include explicit scoring rubrics with level definitions (e.g., 1=incorrect, 3=partially correct, 5=fully correct).
- Rule 3: Trajectory evaluations must score both the correctness of individual steps AND the optimality of the overall path.
- Rule 4: Eval datasets must be versioned using content hashes and must never be modified in-place. Create new versions instead.
- Rule 5: Regression detection must use family-wise error rate correction (Holm-Bonferroni) when testing across multiple dimensions simultaneously.
Reference Guides
Below are links to the reference guides detailing the algorithms, data schemas, code implementations, and integration patterns used in this evaluation testing framework:
- trajectory-evaluation.md Covers step-by-step trajectory scoring algorithms, golden trajectory comparison, semantic step matching, and trajectory optimality metrics.
- llm-as-judge-patterns.md Details LLM-as-judge architectures including pointwise scoring, pairwise comparison, reference-guided judging, consensus protocols, and calibration techniques.
- verifier-agent-patterns.md Defines dedicated verification agent topologies, cross-model verification, execution-based verification, and multi-agent debate protocols.
- cicd-eval-integration.md Provides CI/CD pipeline configurations for GitHub Actions, GitLab CI, and Jenkins with tiered eval stages, artifact management, and deployment gates.
- regression-detection.md Outlines statistical regression detection methods, baseline management, effect size calculations, and alerting threshold configurations.
- benchmark-design.md Explains benchmark dataset design principles, difficulty stratification, contamination prevention, and saturation detection algorithms.
- hallucination-scoring.md Covers hallucination detection and scoring pipelines, claim extraction, source verification, faithfulness metrics, and severity classification.
- eval-dataset-management.md Defines dataset versioning schemas, content-hash registries, stratified sampling strategies, and dataset refresh lifecycle management.
Handoff
For projects requiring prompt optimization before evaluation, hand off to context-engineering. For systems implementing architectural constraints on agent behavior, hand off to architectural-constraints. For agent failure recovery during evaluation runs, hand off to error-recovery.
Implementation Patterns
LLM-as-Judge Implementation
import json
from typing import List, Dict, Any
import numpy as np
class LLMAsJudge:
def __init__(self, judge_model: str = "gpt-4", calibration_samples: int = 50):
self.judge_model = judge_model
self.calibration_samples = calibration_samples
self.calibration_data: List[Dict] = []
def build_judge_prompt(self, rubric: Dict, candidate: str, reference: str = None) -> str:
prompt = "You are an expert evaluator. Score the following output.\n\n"
prompt += "## Scoring Rubric\n"
for dim, criteria in rubric.items():
prompt += f"- {dim}: {criteria}\n"
prompt += "\nScore each dimension 1-5 where 5 is best.\n\n"
prompt += f"## Candidate Output\n{candidate}\n\n"
if reference:
prompt += f"## Reference Answer\n{reference}\n\n"
prompt += "## Output Format\n"
prompt += 'Return JSON: {"dimension_1": score, "dimension_2": score, ...}'
return prompt
def calibrate(self, human_scores: List[Dict], judge_scores: List[Dict]) -> float:
human_flat = [s for d in human_scores for s in d.values()]
judge_flat = [s for d in judge_scores for s in d.values()]
from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(
np.round(human_flat).astype(int),
np.round(judge_flat).astype(int)
)
return kappa
def consensus_score(self, multiple_judgments: List[Dict]) -> Dict:
dims = list(multiple_judgments[0].keys())
result = {}
for dim in dims:
scores = [j[dim] for j in multiple_judgments]
result[dim] = {
"mean": float(np.mean(scores)),
"std": float(np.std(scores)),
"median": float(np.median(scores)),
"min": float(min(scores)),
"max": float(max(scores)),
"n": len(scores),
}
return result
Trajectory Evaluator
from typing import List, Dict, Optional
class TrajectoryEvaluator:
def __init__(self, semantic_threshold: float = 0.85):
self.semantic_threshold = semantic_threshold
def evaluate_step(self, expected: Dict, actual: Dict) -> Dict:
tool_match = expected.get("tool") == actual.get("tool")
param_sim = self._parameter_similarity(
expected.get("parameters", {}),
actual.get("parameters", {})
)
return {
"step_type": "tool_call",
"tool_match": tool_match,
"parameter_similarity": param_sim,
"correct": tool_match and param_sim >= self.semantic_threshold,
}
def _parameter_similarity(self, expected: Dict, actual: Dict) -> float:
if not expected and not actual:
return 1.0
all_keys = set(expected.keys()) | set(actual.keys())
if not all_keys:
return 1.0
matches = sum(1 for k in all_keys if expected.get(k) == actual.get(k))
return matches / len(all_keys)
def evaluate_trajectory(self, expected_steps: List[Dict], actual_steps: List[Dict]) -> Dict:
step_scores = []
for i, (exp, act) in enumerate(zip(expected_steps, actual_steps)):
step_scores.append(self.evaluate_step(exp, act))
correct_steps = sum(1 for s in step_scores if s["correct"])
return {
"step_count": len(step_scores),
"correct_steps": correct_steps,
"step_accuracy": correct_steps / max(len(step_scores), 1),
"path_optimality": self._compute_optimality(expected_steps, actual_steps),
"step_details": step_scores,
}
def _compute_optimality(self, expected: List, actual: List) -> float:
return min(1.0, len(expected) / max(len(actual), 1))
Hallucination Detection Pipeline
import re
from typing import List, Tuple
class HallucinationDetector:
def __init__(self, verifier_model: str = "gpt-4"):
self.verifier_model = verifier_model
def extract_claims(self, text: str) -> List[str]:
sentences = re.split(r'(?<=[.!?])\s+', text)
claims = []
for s in sentences:
if any(kw in s.lower() for kw in ["is", "are", "was", "were",
"has", "have", "contains",
"located", "found", "known"]):
claims.append(s.strip())
return claims[:20]
def verify_claims(self, claims: List[str], source_docs: List[str]) -> List[Dict]:
results = []
for claim in claims:
verification = self._verify_single(claim, source_docs)
results.append({
"claim": claim,
"supported": verification["supported"],
"confidence": verification["confidence"],
"source": verification.get("source"),
})
return results
def _verify_single(self, claim: str, sources: List[str]) -> Dict:
claim_lower = claim.lower()
best_match = 0.0
best_source = None
for src in sources:
src_lower = src.lower()
claim_words = set(claim_lower.split())
src_words = set(src_lower.split())
overlap = len(claim_words & src_words) / max(len(claim_words), 1)
if overlap > best_match:
best_match = overlap
best_source = src[:200]
return {
"supported": best_match > 0.3,
"confidence": best_match,
"source": best_source,
}
def compute_hallucination_rate(self, claims: List[Dict]) -> float:
if not claims:
return 0.0
unsupported = sum(1 for c in claims if not c["supported"])
return unsupported / len(claims)
Eval Dataset Manager
import hashlib
import json
from typing import List, Dict
from datetime import datetime
class EvalDatasetManager:
def __init__(self, registry_path: str = "./eval_registry.json"):
self.registry_path = registry_path
def register_dataset(self, name: str, test_cases: List[Dict]) -> Dict:
content_hash = hashlib.sha256(
json.dumps(test_cases, sort_keys=True).encode()
).hexdigest()[:16]
entry = {
"name": name,
"version": content_hash,
"created_at": datetime.utcnow().isoformat(),
"num_cases": len(test_cases),
"difficulty_tiers": self._compute_tiers(test_cases),
}
self._save_entry(entry)
return entry
def _compute_tiers(self, cases: List[Dict]) -> Dict:
tiers = {"easy": 0, "medium": 0, "hard": 0}
for case in cases:
difficulty = case.get("difficulty", "medium")
if difficulty in tiers:
tiers[difficulty] += 1
return tiers
def _save_entry(self, entry: Dict):
try:
with open(self.registry_path, "r") as f:
registry = json.load(f)
except (FileNotFoundError, json.JSONDecodeError):
registry = []
registry.append(entry)
with open(self.registry_path, "w") as f:
json.dump(registry, f, indent=2)
def check_contamination(self, new_cases: List[Dict]) -> List[Dict]:
contaminated = []
try:
with open(self.registry_path, "r") as f:
registry = json.load(f)
except (FileNotFoundError, json.JSONDecodeError):
return contaminated
for new_case in new_cases:
new_hash = hashlib.md5(
json.dumps(new_case, sort_keys=True).encode()
).hexdigest()
for entry in registry:
if any(isinstance(e, dict) for e in entry):
continue
for entry in registry:
if entry.get("name", "").startswith("contamination_check"):
continue
return contaminated
Architecture Decision Trees
Evaluation Strategy Selection
What type of agent output?
├── Single-turn factual response
│ ├── Ground truth available → Exact Match / F1 / BLEU
│ └── No ground truth → LLM-as-Judge (pointwise) + calibration
│
├── Multi-step tool-calling
│ ├── Trajectory matters → Step-level trajectory evaluation
│ └── Only final outcome → Outcome diff + state comparison
│
├── Code generation
│ ├── Executable → pass@k with unit tests
│ └── Non-executable → LLM-as-Judge (pairwise)
│
├── Summarization
│ ├── Reference available → ROUGE-L + BERTScore
│ └── No reference → LLM-as-Judge + faithfulness check
│
└── Chat/conversation
├── Single response → Dimension-based rubric scoring
└── Full conversation → Trajectory + outcome + coherence
Statistical Test Selection
Comparing agent versions?
├── Paired data (same test cases, two model versions)
│ ├── Normal distribution → Paired t-test
│ └── Non-normal → Wilcoxon signed-rank test
│
├── Unpaired data (different test sets)
│ ├── Normal distribution → Independent t-test
│ └── Non-normal → Mann-Whitney U test
│
├── Multiple dimensions simultaneously
│ └── Holm-Bonferroni correction for alpha
│
└── Small sample (n < 30)
└── Bootstrap confidence intervals (1000 resamples)
Production Considerations
- Tiered evaluation in CI: Run fast smoke tests (5% of cases) on every PR commit. Run full suite (100% of cases) on merge to main. Use 3-tier pipeline: smoke → regression → full.
- Eval cost management: LLM-as-judge eval costs can exceed agent generation costs. Use cheaper judge models (GPT-4o-mini) for bulk eval, expensive judges (GPT-4) for calibration only.
- Parallel evaluation: Run independent eval cases in parallel batches. Use async execution to reduce wall-clock time from hours to minutes for 1000+ case suites.
- Baseline versioning: Store baseline distributions (mean, std, N) not just point scores. Enables proper statistical regression detection across version comparisons.
Security Considerations
- Jailbreak detection in eval: Include adversarial test cases that probe for instruction-following violations. Score safety as a mandatory eval dimension.
- Eval data poisoning protection: Hash and verify eval datasets to prevent tampering. Use checksums stored in a separate integrity registry.
- Judge model bias auditing: Periodically audit LLM-as-Judge for biases (preferring longer outputs, specific writing styles). Re-calibrate against human judgments quarterly.
Anti-Patterns
| Anti-Pattern | Why It Fails | Correct Approach |
|---|---|---|
| Single-sample evaluation | LLM non-determinism makes results unreproducible | Use N ≥ 5 samples, report confidence intervals |
| Same model for generation and judging | Self-bias inflates scores | Use different model family for judging |
| Static test datasets without refresh | Benchmark saturation over time | Stratify by difficulty, refresh 20% quarterly |
| Ignoring trajectory in multi-step agents | Right answer through wrong reasoning is latent bug | Always evaluate both trajectory and outcome |
| Using default temperature for eval | High temperature adds noise to judged scores | Fix temperature at 0 for metric-based, 0.3 for judge-based |
| No calibration of judge models | Scores may not correlate with human preferences | Calibrate against ≥50 human-labeled samples, target κ ≥ 0.6 |
| Multiple comparisons without correction | Inflated false positive rate in regression detection | Apply Holm-Bonferroni or Benjamini-Hochberg correction |
Performance Optimization
- Batch judge prompts: Combine multiple eval cases into single API calls with structured output schemas to reduce API overhead by 40-60%.
- Embedding caching: Cache embeddings for test case inputs and reference answers to avoid recomputation across evaluation runs.
- Incremental eval: Only re-evaluate test cases affected by code changes (impact analysis via dependency graph). Reduces eval time by 70-90% for targeted changes.
- Judge model distillation: Train a smaller, cheaper judge model on GPT-4 judgments for bulk evaluation. Validate alignment periodically against the full judge model.