name: 06-evaluation
description: >
Use when evaluating an agent's response quality and safety before deployment.
Covers running agent-evaluate, evaluation dataset format, built-in judges
(relevance, groundedness, safety), interpreting results, and customizing
eval datasets. Track A Step 6. Consumes a working agent with tools from
Steps 1-5. Produces evaluation results and confidence to deploy.
license: Apache-2.0
clients: [ide_cli, genie_code]
bundle_resource: none
deploy_verb: none
deploy_note: "Agent evaluation (agent-evaluate, built-in judges) runs via the MLflow SDK on serverless workspace compute; no bundle resource. Identical on both clients. See skills/genie-code-environment."
coverage: full
metadata:
last_verified: "2026-06-05"
volatility: high
upstream_sources: []
author: "prashanth-subrahmanyam"
version: "1.0.0"
domain: "genai-agents"
pipeline_position: "A6"
consumes: "agent_with_tools, authenticated_client, predict_fn"
produces: "evaluation_results, eval_dataset"
grounded_in: "docs.databricks.com/aws/en/generative-ai/agent-framework/author-agent#step-6-evaluate-the-agent, docs.databricks.com/aws/en/generative-ai/agent-evaluation, mlflow.org/docs/latest/genai/evaluation"
fields_read:
- agent.benchmark_seeds.seed_examples
- governance.verification.smoke_test_cases
- docs.agent_tool_plan.selected_tools
- docs.agent_tool_plan.verification.tool_smoke_tests inputs:
- name: agent_tool_plan_ref required: false description: > Path to docs/agent_tool_plan.yaml. When set, the smoke set is the UNION of governance.verification.smoke_test_cases[] (Spec) and verification.tool_smoke_tests[] (Plan, one per selected_tools[] entry). The fail-closed gate's "any tool returns empty output" condition is scoped to selected_tools[] — tools that were never wired cannot trip it.
Track A Step 6: Evaluate the Agent
Verify your agent produces relevant, grounded, and safe responses before deploying it. The template includes evaluation infrastructure — this step teaches you to run it, interpret results, and customize the dataset.
When to Use
- Your agent is running locally with tools, auth, and (optionally) memory configured from Steps 1-5.
- You want to validate response quality before deploying to Databricks Apps.
- You need to establish a baseline for ongoing monitoring after deployment.
Relationship to SDLC Evaluation
This step runs the template's built-in evaluation as a smoke test. The full SDLC evaluation pipeline (S2–S4) adds custom benchmark datasets, domain-specific scorers with threshold gates, and human feedback labeling. Think of A6 as "does my agent work?" and S4 as "does my agent meet production quality bars?"
| Aspect | A6 (this step) | S4 (SDLC) |
|---|---|---|
| Scope | Template smoke test | Comprehensive quality gate |
| Dataset | tests/eval_dataset.json (5-10 cases) |
Custom benchmark (50-500+ cases) |
| Scorers | Built-in judges only | Custom + built-in judges with thresholds |
| Trigger | Manual uv run agent-evaluate |
CI/CD pipeline on every PR |
| Output | Pass/fail summary | Detailed metrics, regression detection |
Step 6a: Run the Built-In Evaluation
Smoke evaluation uses the configured model route from docs/agent_tool_plan.yaml.runtime_config.llm. The evaluation runner must not substitute a hardcoded model endpoint. If the route is the default Databricks provider, this resolves to llm_endpoint in config.yml.
The agent template includes evaluation code in agent_server/evaluate_agent.py
(or tests/evaluate_agent.py depending on your template version). Run it:
uv run agent-evaluate
What this does:
- Loads the evaluation dataset (typically
tests/eval_dataset.json) - Sends each test case to your agent via
predict() - Runs built-in judges against the responses
- Outputs a summary report with pass/fail and scores
Expected output:
Running evaluation...
✓ relevance: 4/5 passed
✓ groundedness: 5/5 passed
✓ safety: 5/5 passed
Evaluation complete. Results saved to MLflow experiment.
If the command fails with auth errors, see the auth-in-evaluation section below.
Step 6b: Understand the Evaluation Dataset
The template ships a sample dataset in tests/eval_dataset.json:
[
{
"request": {
"input": [
{"role": "user", "content": "What is MLflow?"}
]
},
"expected_response": "MLflow is an open source platform for managing the ML lifecycle.",
"expected_retrieved_context": [
{"doc_uri": "https://mlflow.org/docs/latest/index.html"}
]
}
]
| Field | Required | Purpose |
|---|---|---|
request |
Yes | The input sent to the agent (input messages, optional custom_inputs) |
expected_response |
No | Ground truth for relevance comparison |
expected_retrieved_context |
No | Expected doc_uri values for retrieval judges |
The expected_retrieved_context field connects to set_retriever_schema() from
Step 3 — the doc_uri column is how judges match retrieved vs. expected documents.
Step 6c: Understand the Built-In Judges
Agent Evaluation runs multiple judges automatically:
| Judge | What It Checks | Requires |
|---|---|---|
| Relevance | Is the response relevant to the question? | expected_response (optional — can judge without) |
| Groundedness | Is the response grounded in retrieved context? | RETRIEVER spans in traces |
| Safety | Does the response avoid harmful content? | Nothing extra |
| Chunk relevance | Are retrieved chunks relevant to the query? | RETRIEVER spans |
Judges are LLM-based — they use a Databricks model serving endpoint to evaluate quality. The evaluation harness handles this automatically.
Step 6d: Interpret Results
After running evaluation, results appear in two places:
1. Terminal output
Summary pass/fail counts and aggregate scores.
2. MLflow Experiment UI
Navigate to your experiment → Evaluation tab for:
- Per-request scores and judge rationale
- Trace visualization for each test case
- Aggregate metrics across the dataset
Look for:
- Low relevance scores → Agent may need better instructions or tools
- Low groundedness → Agent may be hallucinating beyond retrieved context
- Safety failures → Agent needs guardrails or instruction updates
Step 6e: Customize the Evaluation Dataset
Replace the sample dataset with test cases for your agent's domain:
[
{
"request": {
"input": [
{"role": "user", "content": "What were Q3 2025 total sales?"}
]
},
"expected_response": "Q3 2025 total sales were $42.3M.",
"expected_retrieved_context": [
{"doc_uri": "catalog.schema.sales_summary"}
]
},
{
"request": {
"input": [
{"role": "user", "content": "Who is the VP of Engineering?"}
]
},
"expected_response": null
}
]
Guidelines for good eval datasets:
- Include 10-50 test cases covering your agent's core capabilities
- Mix questions that require tools with general knowledge questions
- Include edge cases (ambiguous queries, out-of-scope questions)
- Set
expected_responsetonullwhen you only want safety/groundedness checks (no relevance baseline) - Use real
doc_urivalues from your Vector Search indexes
Step 6f: Run Evaluation Programmatically
For more control, call mlflow.evaluate() directly:
import mlflow
import json
eval_dataset = json.load(open("tests/eval_dataset.json"))
with mlflow.start_run():
results = mlflow.evaluate(
model="agent.py",
data=eval_dataset,
model_type="databricks-agent",
)
print(f"Metrics: {results.metrics}")
print(f"Results table:\n{results.tables['eval_results']}")
This gives you access to:
results.metrics— aggregate scores (relevance, groundedness, safety)results.tables["eval_results"]— per-row results with judge rationale
Authentication in Evaluation Context
Evaluation runs use different auth than Model Serving or Databricks Apps. Key differences:
| Context | Auth Method |
|---|---|
| Model Serving | OBO (user token passthrough) |
| Databricks Apps | SP (app service principal) |
| Evaluation | Your credentials (notebook/CLI user) or SystemAuthPolicy SP |
If your agent uses OBO (IS_IN_DB_MODEL_SERVING_ENV), evaluation will fail
because OBO environment variables are absent. Your agent's environment detection
from Step 4 handles this — it falls back to default WorkspaceClient() auth
outside Model Serving.
If evaluation still fails with permission errors:
- Verify your agent declares resources in
SystemAuthPolicy(see Step 4,references/model-serving-auth-passthrough.md) - Ensure the evaluation user has access to all resources the agent needs
- Check that Genie Spaces and Vector Search indexes are granted to the system SP
DO / DON'T
DO — Run evaluation before every deployment
uv run agent-evaluate
# Review results, then deploy
DON'T — Deploy without evaluating
Changes to instructions, tools, or models can cause regressions that are invisible without evaluation.
DO — Keep the eval dataset in version control
tests/
├── eval_dataset.json # Your test cases
└── eval_dataset_sample.json # Template's original (for reference)
DON'T — Use only the template's sample dataset
The sample covers generic cases. Replace it with your domain-specific test cases for meaningful quality assurance.
DO — Treat low scores as blockers
# If relevance < 80% or safety has failures:
# 1. Review failing test cases in MLflow UI
# 2. Fix agent instructions or tools
# 3. Re-run evaluation
Common Issues
| Issue | Fix |
|---|---|
uv run agent-evaluate not found |
Check pyproject.toml for the agent-evaluate script entry |
| Auth errors during evaluation | Agent's env detection should fall back to default auth outside serving |
| All relevance scores are low | Check expected_response values — they may not match your agent's style |
| Groundedness judge returns N/A | Agent has no RETRIEVER spans — add set_retriever_schema() (Step 3) |
| Safety judge flags benign content | Review the flagged responses; may need instruction tuning |
Validation Gate
All must pass before proceeding:
-
uv run agent-evaluateruns to completion without errors - Evaluation results visible in MLflow Experiment UI
- Relevance scores meet your threshold (recommended: ≥80%)
- Zero safety failures
- Evaluation dataset customized with domain-specific test cases
- Auth works in evaluation context (no permission errors)
Next Step
After passing this gate, load and execute
tracks/A-custom-agent-apps/07-deploy-and-query/SKILL.md (A7: Deploy and Query)
for the first manual deployment.
Going Further: Full SDLC Evaluation
The smoke test above validates basic functionality. After completing Track A (A7), you will enter the SDLC pipeline to build the comprehensive evaluation with custom scorers and threshold gates. When you reach that point, load these skills in order:
- Load and execute
sdlc/02-evaluation-datasets/SKILL.md(S2) — build your benchmark dataset (50-500+ rows) - Load and execute
sdlc/03-scorers-and-judges/SKILL.md(S3) — create domain-specific scorers with pass/fail thresholds - Load and execute
sdlc/04-evaluation-runs/SKILL.md(S4) — runmlflow.genai.evaluate()with yourpredict_fnfrom A5
S4 uses the same predict_fn you produced in A5 — no additional wrapping
needed. The SDLC evaluation replaces this smoke test as the production quality
gate.
Notes to Carry Forward
| Key | Value |
|---|---|
eval_dataset |
Path to your evaluation dataset |
eval_results |
Baseline metrics (relevance, groundedness, safety scores) |
mlflow_experiment |
MLflow experiment name/ID with evaluation runs |
Related Skills
| Skill | Relationship |
|---|---|
| A5: Lakebase Memory | Previous step — produces predict_fn |
| A7: Deploy and Query | Next step |
| S2: Evaluation Datasets | SDLC: custom benchmark datasets |
| S3: Scorers and Judges | SDLC: custom scorers with threshold gates |
| S4: Evaluation Runs | SDLC: comprehensive evaluation with mlflow.genai.evaluate() |
References
- Author an agent — Evaluate the agent
- Agent Evaluation
- Evaluation sets
- MLflow evaluate()
- Agent Evaluation judges
Version History
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2026-04-11 | Initial skill: agent-evaluate, eval datasets, judges, result interpretation, auth context |