name: 02-evaluation-datasets
description: >
Use when you need to create, manage, or load evaluation datasets for
testing agent quality. Covers the MLflow GenAI data format, persisting
benchmarks in Unity Catalog, merging records without duplicates, and
validating data before evaluation — even if you just want "give me a
dataset I can pass to mlflow.genai.evaluate()." Also use when building
benchmarks from production traces or SME labels. SDLC Step 2.
license: Apache-2.0
compatibility: "Requires Databricks workspace with MLflow 3.10+ and Unity Catalog. Scripts use uv."
clients: [ide_cli, genie_code]
bundle_resource: none
deploy_verb: none
deploy_note: "Evaluation datasets persisted in UC via the MLflow GenAI SDK; no bundle resource. Identical on both clients; on Genie Code run any CLI step through runDatabricksCli. See skills/genie-code-environment."
coverage: full
metadata:
last_verified: "2026-06-05"
volatility: high
upstream_sources: []
author: "prashanth-subrahmanyam"
version: "4.1.0"
domain: "genai-agents"
pipeline_position: "S2"
consumes: "registered_prompts"
produces: "evaluation_dataset, mlflow_dataset_entity"
grounded_in: "https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/build-eval-dataset, https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/eval-harness, https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/evaluation-runs"
upstream_sources:
- name: "ai-dev-kit"
repo: "databricks-solutions/ai-dev-kit"
paths:
- "databricks-skills/databricks-mlflow-evaluation/SKILL.md"
relationship: "extended"
last_synced: "2026-04-27"
sync_commit: "281d9acd92d936bd5294f78bd7ec68fb12d4a696"
fields_read:
- ui.user_journeys
- agent.benchmark_seeds.coverage_buckets
- agent.benchmark_seeds.seed_examples
- docs.agent_tool_plan.selected_tools
- docs.agent_tool_plan.verification.tool_smoke_tests inputs:
- name: agent_tool_plan_ref required: false description: > Path to docs/agent_tool_plan.yaml. When set, the skill APPENDS one tool-shaped benchmark row per verification.tool_smoke_tests[] entry on top of the generic Spec rows. Tool families absent from selected_tools[] contribute zero appended rows.
Evaluation Dataset Creation
Patterns for building, validating, versioning, and logging evaluation datasets for GenAI agents. Complements SDLC Step 4 (evaluation runs and scorers) by focusing on data shape, UC persistence, and dataset lifecycle, grounded in Databricks MLflow GenAI eval monitor documentation.
Upstream Lineage
This skill extends AI-Dev-Kit's databricks-mlflow-evaluation skill for evaluation dataset construction, production trace-to-dataset workflows, and expectation schema guidance. If local dataset guidance is insufficient or MLflow evaluation dataset APIs drift, consult the upstream skill first, then adapt its patterns to this workshop's canonical row fields and SDLC artifact contracts.
When to Use
- Defining benchmark suites for agent quality gates
- Persisting rows to Unity Catalog Delta tables behind
mlflow.genai.datasets - Loading and merging records without schema surprises across jobs or tasks
- Validating rows before
mlflow.genai.evaluate()so bad data never skews metrics - Balancing splits (
train/held_out) and provenance (curated,synthetic,human_labeled)
Official data format (mlflow.genai.evaluate)
Per the eval harness parameters, each record aligns with EvaluationDataset schema. Either inputs + outputs or trace is required; you cannot pass both.
| Field | Data type | Description | Direct evaluation (predict_fn) |
Answer sheet |
|---|---|---|---|---|
inputs |
dict[Any, Any] |
Passed to predict_fn as **kwargs; keys must match parameter names; JSON-serializable |
Required | From trace if omitted |
outputs |
dict[Any, Any] |
App outputs for that input; JSON-serializable | Omit (MLflow builds from trace) | Required with inputs |
expectations |
dict[str, Any] |
Ground truth for scorers; keys are str; JSON-serializable |
Optional | Optional |
trace |
mlflow.entities.Trace |
Full trace for the request | Omit (MLflow generates) | Required instead of inputs+outputs |
Direct evaluation: rows typically have inputs and optional expectations only.
Answer sheet: rows have inputs + outputs, or a single trace, plus optional expectations.
EvaluationDataset vs DataFrame / list-of-dicts
Databricks recommends mlflow.genai.datasets.EvaluationDataset when available: it enforces schema validation and improves lineage tracking. Raw pandas DataFrame, list of dicts, or Spark DataFrame are accepted if they match the same column semantics. See Building MLflow evaluation datasets.
Evaluation record schema (transferable)
mlflow.genai.evaluate() unpacks each row's inputs dict as keyword arguments to predict_fn. Every key in inputs must match the predictor's signature (use **kwargs only for optional or closure-fed values).
Minimal shape (docs and quick tests)
# Chat / generic agent
{"inputs": {"question": "What is X?"}, "expectations": {"reference_answer": "..."}}
# RAG agent — pass what your app and scorers need
{
"inputs": {"query": "...", "retrieved_context": "..."},
"expectations": {"citations": ["doc:page"], "answer": "..."},
}
# SQL / tool agent
{
"inputs": {"question": "Total cost by region?"},
"expectations": {"expected_sql": "SELECT ...", "expected_tables": ["db.schema.t"]},
}
Production-oriented shapes (examples)
Use only fields your predict_fn and scorers actually read. Typical additions:
| Agent type | inputs (examples) |
expectations (examples) |
|---|---|---|
| Chat | messages, session_id, user_id |
reference_answer, safety labels |
| RAG | query, document_ids, max_chunks |
answer, citation_spans, must_cite |
| SQL / analytics | question, catalog, schema, constraints |
expected_sql or expected_result_hash, lineage (split, provenance, validation_status) |
If scorers read ground truth from expectations, mirror any critical label from inputs into expectations when your judges expect a fixed key (for example SQL string under both inputs["expected_sql"] and expectations["expected_response"]).
Canonical evaluation-dataset fields (normative)
Every row in a benchmark dataset MUST carry the following canonical fields, regardless of agent shape. These names are the contract between dataset producers, runners, and scorers — do not invent synonyms.
eval_dataset_required_fields:
- row_id # stable per-row identifier (used as merge key, dedup key, regression-tracking key)
- request # the user-facing input (mirrored under inputs.<key> for predict_fn)
- expected_response # canonical ground truth consumed by the Correctness scorer
- expected_signal # SECONDARY classification field (e.g. intent, severity, topic) — never silently mirrored into expected_response
- bucket # coverage bucket (e.g. "aggregation", "ranking", "edge_case", "permission_denied")
- journey_id # which user journey this row exercises (links to ui.user_journeys)
- split # train | held_out | regression | gold
- provenance # curated | synthetic | auto_corrected | issue_failing_trace | labeling_session_merge
Hard rules:
expected_responseis the ONLY field theCorrectnessscorer reads as ground truth. Populate it directly with the answer string (or canonical SQL) the agent should produce.expected_signalis allowed only as a secondary classification field (e.g.expected_signal = "policy_refusal"for a guardrail label). The runner MUST NOT silently mirrorexpected_signalintoexpected_response. If a scorer needs the signal as ground truth, write it underexpected_responseexplicitly with intent.row_idis required formerge_recordsupserts and for tagging regressions across runs. Use a stable hash of(request, journey_id)if you do not have a natural key.bucket,journey_id, andsplittogether support coverage gates (see below).
Coverage gates (normative thresholds)
Datasets that do not meet the following minima MUST fail pre-evaluation validation:
min_rows: 40
per_bucket_min_rows: 1
per_journey_min_rows: 1
expectations_schema_complete: true # all required fields present + non-empty for split != "regression"
eval_dataset_canonical_source: enum # uc_table | local_json | labeling_session_merge
expectations_schema_complete means: for every row where split != "regression", all canonical fields above are present and expected_response is non-empty. Regression rows may carry expected_response = null only when the row is gated on a scorer threshold (see references/benchmark-generation.md §11).
eval_dataset_canonical_source records where the row set came from. The runner reads this once at startup and refuses to evaluate if the value is unknown — this prevents silent mixing of incompatible sources.
Creating and persisting datasets (UC)
Table naming: use a stable Unity Catalog identifier, e.g. {catalog}.{schema}.{app_name}_benchmarks (replace with your catalog, schema, and app slug).
Tabular construction (pandas)
import pandas as pd
def rows_to_eval_df(questions, expectations=None):
records = []
for i, q in enumerate(questions):
r = {"inputs": {"question": q}}
if expectations and i < len(expectations):
r["expectations"] = expectations[i]
records.append(r)
return pd.DataFrame(records)
get_dataset + merge_records (SDK)
Aligns with Create a dataset using the SDK:
import mlflow.genai.datasets
uc_table = f"{catalog}.{schema}.{app_name}_benchmarks"
try:
eval_dataset = mlflow.genai.datasets.get_dataset(name=uc_table)
except Exception:
eval_dataset = mlflow.genai.datasets.create_dataset(name=uc_table)
records = [...] # list of dicts: inputs / outputs / expectations / trace per rules above
eval_dataset.merge_records(records)
After merges, eval_dataset.to_df() (or your validated in-memory frame) is the source of truth for row counts and deduplication before evaluation.
Loading for evaluation
- Preferred: materialize from the
EvaluationDatasetor the DataFrame you already validated in memory. - Spark / Delta: if you read
{catalog}.{schema}.{app_name}_benchmarksdirectly, useREFRESH TABLE(or equivalent) when another task may have altered the table in the same run, then parse JSON columns ifinputs/expectationsare stored as strings.
Generic read pattern:
def load_eval_rows_spark(spark, full_table_name: str):
spark.sql(f"REFRESH TABLE {full_table_name}")
return spark.table(full_table_name)
Best practice: evaluate from deduped in-memory data, not the raw UC table
merge_records upserts by record identity; if the same logical example is merged with different IDs, the Delta table can accumulate stale duplicates. Downstream, reading the table without the same dedupe logic as merge can inflate row counts and distort metrics.
Do: dedupe in memory (for example by stable business key: normalized question, session+turn, or hash of inputs) and pass that DataFrame or list to mlflow.genai.evaluate().
Don't: assume the UC table is duplicate-free or that merge_records removed older variants unless your merge keys guarantee it.
Generating benchmarks (generic)
Common sources (see also Data sources for evaluation datasets):
- Production traces — sample via
mlflow.search_traces(), filter for quality or failure modes, merge into the dataset after review. - SME-curated — subject-matter experts label inputs and expectations; best for gold sets and regressions.
- LLM-generated — expand coverage quickly; always validate (schema, permissions, tool contracts) before merge or evaluate.
Validate structure, tool outputs, and permissions before calling mlflow.genai.evaluate() so invalid rows never enter scored runs.
Load references/synthetic-eval-generation.md if you need a full recipe for generating a starter eval set from production traces (cluster traces by intent → LLM-generate paraphrases → expectations harvested from SME-labeled traces → validate → merge).
Load references/benchmark-generation.md if you need end-to-end LLM benchmark generation, Genie Q&A extraction, SQL validation, and issue-focused subsets from failing traces.
Dataset versioning and lineage
- Storage: UC Delta tables behind
mlflow.genai.datasets(name like{catalog}.{schema}.{app_name}_benchmarks). - Runs: log dataset identity and row counts on the evaluation run (see Evaluation runs).
- Deduping: enforce stable keys at merge time and again before evaluate.
- Balance: track counts per category or split when mixing curated and synthetic data.
Common mistakes
| Mistake | Consequence | Fix |
|---|---|---|
inputs keys ≠ predict_fn parameters |
TypeError or evaluate failures |
Align names; optional args via **kwargs |
Ground truth only in inputs or only in expectations |
Scorers see empty labels | Put scorer-expected keys in expectations |
| Reading Delta without refresh after concurrent writes | Stale or schema errors | REFRESH TABLE + retry |
| Evaluating from raw UC after messy merges | Duplicate / stale rows skew metrics | Dedupe in memory; evaluate that frame |
Mixing inputs+outputs with trace in one row |
Invalid per harness rules | One mode per row |
Using query in data but predict_fn expects question |
Silent mismatch | Same naming in data and app |
DO / DON'T
DO — populate every canonical field; put Correctness ground truth under expected_response:
record = {
"row_id": "billing_aggregation_001",
"inputs": {"request": "Total by region?", "expected_sql": sql},
"expectations": {
"expected_response": sql, # canonical ground truth for Correctness
"expected_signal": "aggregation", # SECONDARY classification — never mirrored into expected_response
"bucket": "aggregation",
"journey_id": "cost_analysis",
"split": "train",
"provenance": "curated",
},
}
DON'T — silently mirror expected_signal into expected_response (Correctness will score against the wrong target):
# WRONG — runner copying expected_signal into expected_response
record["expectations"]["expected_response"] = record["expectations"]["expected_signal"]
DON'T — empty expectations when judges need ground truth:
record = {"inputs": {"request": "...", "expected_sql": sql}, "expectations": {}}
DO — get_dataset with fallback to create_dataset, then merge_records.
DON'T — always create_dataset without checking existence (duplicates / errors).
DO — validate rows (schema, tools, SQL, permissions) before evaluate.
DON'T — trust synthetic outputs without checks.
DO — log dataset name and row_count (and optional validation summaries) on the MLflow run.
DON'T — run evaluate() with no record of which dataset version was used.
Validation checklist
Schema and predictor
- Each row follows harness rules:
inputs+outputsortrace, not both -
inputskeys matchpredict_fnfor direct evaluation -
expectationskeys match what scorers read - All canonical fields present:
row_id,request,expected_response,expected_signal,bucket,journey_id,split,provenance -
Correctnessground truth lives underexpected_responseonly —expected_signalis NOT mirrored into it
Coverage gates
-
min_rows: 40satisfied -
per_bucket_min_rows: 1satisfied for every declared bucket -
per_journey_min_rows: 1satisfied for everyui.user_journeysentry -
expectations_schema_complete: true(all canonical fields present,expected_responsenon-empty forsplit != "regression") -
eval_dataset_canonical_sourceset to one ofuc_table | local_json | labeling_session_merge
UC and MLflow
- Dataset name
{catalog}.{schema}.{app_name}_benchmarksconsistent across writers and loaders - After concurrent DDL/DML, refresh reads in multi-task jobs
- Evaluation uses deduped data aligned with merge semantics
Content
- Tool / SQL / RAG outputs validated where applicable
- Provenance and split fields populated for traceability
- Regression rows augment, not replace, baseline rows (see
references/benchmark-generation.md§11)
Scripts
| Script | Description |
|---|---|
scripts/create_eval_dataset.py |
Optional: load questions from JSON, validate, dedupe, persist to UC / MLflow dataset. |
References
Official documentation
- Building MLflow evaluation datasets
- Evaluate GenAI during development (eval harness)
- Evaluation runs
- Evaluation dataset reference (UI + schema)
- MLflow
genai.evaluate
Related skills
- SDLC Step 3 (
03-scorers-and-judges) — scorers, judges,predict_fncontract - SDLC Step 1 — prompts used during synthetic or judged workflows
- SDLC Step 4 — wiring
predict_fnand evaluation execution
Local reference files
| Reference | Content |
|---|---|
references/evaluation-dataset-patterns.md |
Record shapes, splits, dedup, builder patterns |
references/dataset-lineage.md |
mlflow.data, log_input, GenAI datasets, Delta lineage |
references/benchmark-generation.md |
Trace mining, LLM generation, validation, provenance |
Version history
| Version | Date | Changes |
|---|---|---|
| 4.1.0 | 2026-04-26 | Added canonical evaluation-dataset fields (row_id, request, expected_response, expected_signal, bucket, journey_id, split, provenance) and coverage gates (min_rows: 40, per-bucket/per-journey minima, expectations_schema_complete, eval_dataset_canonical_source). Pinned Correctness ground truth to expected_response; banned silent mirroring of expected_signal. |
| 4.0.0 | 2026-04-10 | De-coupled from repo-specific patterns. Grounded in official Databricks eval-harness and build-eval-dataset docs. Generic record schemas for different agent types. |
| 3.1.0 | 2026-03-27 | Added reference files, DO/DON'T examples, version history, scripts section |
| 3.0.0 | 2026-03-25 | Initial structured skill with dataset schema, creation, loading, lineage, generation, versioning |