name: agent-evaluation description: Evaluate GenAI agent task execution using LLM-as-judge. Produces structured scores (0-5), feedback, and improvement recommendations. metadata: author: opencode-reflection-plugin version: "1.0"
Agent Evaluation Skill
Evaluate AI agent task execution using world-class LLM-as-judge patterns from DeepEval, RAGAS, and G-Eval frameworks.
Output Format
Evaluation results are saved to evals/results/eval-${yyyy-mm-dd-hh-mm}-${commit_id}.md
Results Table
| Task Input | Agent Output | Reflection Input | Reflection Output | Score | Verdict | Feedback |
|---|---|---|---|---|---|---|
| Create hello.js... | I've created hello.js with... | Task: Create hello.js Agent Output: ... | Task complete | 5/5 | COMPLETE | Agent produced output; Found completion indicators |
| Fix the bug... | I found the issue and... | Task: Fix bug Agent Output: ... | (none) | 3/5 | PARTIAL | Agent produced output; Missing reflection |
Run Evaluation
# Run E2E evaluation
npx tsx eval.ts
# Or via npm
npm run eval:e2e
# Output saved to: evals/results/eval-2026-01-28-12-30-abc1234.md
Evaluation Rubric (0-5)
| Score | Verdict | Criteria |
|---|---|---|
| 5 | COMPLETE | Task fully accomplished. All requirements met. Optimal execution. |
| 4 | MOSTLY_COMPLETE | Task done with minor issues. 1-2 suboptimal steps. |
| 3 | PARTIAL | Core objective achieved but significant gaps or errors. |
| 2 | ATTEMPTED | Progress made but failed to complete. Correct intent, wrong execution. |
| 1 | FAILED | Wrong approach or incorrect result. |
| 0 | NO_ATTEMPT | No meaningful progress. Crashed or no output. |
Pass threshold: >= 3 (development), >= 4 (production)
Evaluation Prompt Template
Use this prompt for LLM-as-judge evaluation:
You are an expert evaluator assessing AI agent task completion.
## Original Task
{{task}}
## Execution Trace
{{trace}}
## Final Output
{{output}}
## Evaluation Criteria
1. Was the core objective achieved?
2. Were appropriate tools selected?
3. Were tool arguments correct?
4. Was execution efficient (minimal steps)?
5. Is the final output accurate and complete?
## Scoring Rubric
- 5: COMPLETE - All requirements met perfectly
- 4: MOSTLY_COMPLETE - Minor issues only
- 3: PARTIAL - Core done but significant gaps
- 2: ATTEMPTED - Progress made but failed
- 1: FAILED - Wrong approach or result
- 0: NO_ATTEMPT - No meaningful progress
## Instructions
1. Analyze the execution step-by-step
2. Identify specific issues or strengths
3. Score using the rubric
4. Provide actionable recommendations
## Response Format (JSON only)
{
"reasoning": "<step-by-step analysis>",
"score": <0-5>,
"verdict": "<COMPLETE|MOSTLY_COMPLETE|PARTIAL|ATTEMPTED|FAILED|NO_ATTEMPT>",
"feedback": "<1-2 sentence summary>",
"recommendations": ["<improvement 1>", "<improvement 2>"]
}
Quick Evaluation Playbook
Step 1: Extract Data
# Get task from session
TASK=$(cat .reflection/session_*.json | jq -r '.task' | head -1)
# Get execution trace (last 20 messages)
TRACE=$(opencode session messages --limit 20 --format json)
# Get final output
OUTPUT=$(opencode session messages --last --format text)
Step 2: Run Evaluation
# Using promptfoo (recommended)
cd evals && npx promptfoo eval \
-c agent-eval.yaml \
--var task="$TASK" \
--var trace="$TRACE" \
--var output="$OUTPUT" \
-o results/eval-$(date +%s).json
# Or using direct API call
curl -X POST "https://api.openai.com/v1/chat/completions" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "<EVAL_PROMPT>"}],
"response_format": {"type": "json_object"}
}' | jq '.choices[0].message.content | fromjson'
Step 3: Parse Results
# Extract score and feedback
cat results/eval-*.json | jq '{
score: .score,
verdict: .verdict,
feedback: .feedback,
recommendations: .recommendations
}'
Metrics Reference
Core Agent Metrics
| Metric | Type | Description |
|---|---|---|
| Task Completion | 0-5 | Overall goal achievement |
| Tool Correctness | binary | Right tools selected |
| Argument Accuracy | 0-1 | Tool arguments correct |
| Step Efficiency | 0-1 | Minimal steps to goal |
Composite Scores
overall_score = (
task_completion * 0.5 +
tool_correctness * 0.2 +
argument_accuracy * 0.2 +
step_efficiency * 0.1
)
promptfoo Config Example
Create evals/agent-eval.yaml:
description: Agent task completion evaluation
prompts:
- file://prompts/agent-evaluation.txt
providers:
- id: azure:gpt-4.1-mini
config:
apiHost: eastus.api.cognitive.microsoft.com
deployment_id: gpt-4.1-mini
defaultTest:
assert:
- type: is-json
- type: javascript
value: output.score >= 0 && output.score <= 5
tests:
- vars:
task: "Create a hello.js file that prints Hello World"
trace: |
1. Agent reads current directory
2. Agent creates hello.js with console.log("Hello World")
3. Agent confirms file created
output: "Created hello.js with console.log('Hello World')"
assert:
- type: javascript
value: JSON.parse(output).score >= 4
Integration with Reflection Plugin
The reflection plugin uses this evaluation pattern internally:
// reflection-3.ts - simplified evaluation flow
async function evaluateTask(sessionId: string): Promise<Evaluation> {
const task = extractInitialTask(messages)
const trace = formatExecutionTrace(messages)
const output = extractFinalOutput(messages)
const response = await llm.chat({
messages: [{ role: "user", content: buildEvalPrompt(task, trace, output) }],
response_format: { type: "json_object" }
})
return JSON.parse(response.content)
}
Benchmarks
Standard Test Cases
| # | Task | Expected Score | Notes |
|---|---|---|---|
| 1 | Create file | 5 | Simple, single tool |
| 2 | Multi-file refactor | 4+ | Multiple edits |
| 3 | Debug test failure | 3+ | Iterative process |
| 4 | Research question | 4+ | Read-only, synthesis |
Running Benchmarks
# Run full benchmark suite
npm run eval
# Run specific benchmark
npm run eval:judge
# View results
npm run eval:view
Best Practices
- Always include reasoning - Makes debugging possible
- Use structured JSON output - Parse reliably
- Score consistently - Same rubric across all evals
- Track over time - Catch regressions
- Calibrate with humans - Validate judge accuracy periodically
- Separate outcome vs process - Score both what and how
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| Score always 5 | Prompt too lenient | Add explicit failure criteria |
| Score always low | Rubric too strict | Calibrate with human evals |
| JSON parse error | LLM not following format | Add response_format constraint |
| Inconsistent scores | Ambiguous criteria | Make rubric more specific |
| Slow evaluation | Large trace | Truncate to last N messages |