04-evaluation-runs

name: 04-evaluation-runs description: > Use when running mlflow.genai.evaluate() to test agent quality before deployment. Covers the predict_fn contract, answer-sheet mode for re-scoring existing outputs, threshold gates, retry wrappers, human feedback sessions, and conversation evaluation — even if you just want "run my benchmarks and tell me pass or fail." Also use when collecting human labels to calibrate automated scorers. SDLC Step 4. license: Apache-2.0 compatibility: "Requires Databricks workspace with MLflow 3.10+ and Unity Catalog. Scripts use uv." clients: [ide_cli, genie_code] bundle_resource: none deploy_verb: none deploy_note: "mlflow.genai.evaluate() runs via the MLflow SDK on serverless workspace compute; no bundle resource. Identical on both clients; on Genie Code execute on serverless and run any CLI step through runDatabricksCli. See skills/genie-code-environment." coverage: full metadata: last_verified: "2026-06-05" volatility: high upstream_sources: [] author: "prashanth-subrahmanyam" version: "4.1.0" domain: "genai-agents" pipeline_position: "S4" consumes: "scorer_list, threshold_config, evaluation_dataset" produces: "evaluation_results, thresholds_met, mlflow_run_id, failure_shape_classification, failing_trace_ids, safety_buffer, predict_fn_exception_count, predict_fn_sentinel_count_per_run, judges_with_silent_aggregation_dropouts, mlflow_eval_predict_fn_signature, mlflow_eval_known_quality_issues, evaluation_runs_preflight, human_label_count, synthesized_label_count" grounded_in: "https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/eval-harness, https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/evaluation-runs, https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/evaluate-conversations, https://docs.databricks.com/aws/en/mlflow3/genai/human-feedback/, https://docs.databricks.com/aws/en/mlflow3/genai/human-feedback/expert-feedback/label-existing-traces" upstream_sources: - name: "ai-dev-kit" repo: "databricks-solutions/ai-dev-kit" paths: - "databricks-skills/databricks-mlflow-evaluation/SKILL.md" relationship: "extended" last_synced: "2026-04-27" sync_commit: "281d9acd92d936bd5294f78bd7ec68fb12d4a696" fields_read:

agent.reviewer_role
agent.must_do
agent.must_not_do
docs.agent_tool_plan.selected_tools inputs:
name: agent_tool_plan_ref required: false description: > Path to docs/agent_tool_plan.yaml. When set, failure-shape classification scopes primary_shape: tool_call_empty (and the tool_call_empty routing branch) to tools present in selected_tools[] only. Tools that were never wired cannot trigger this branch. The retrieval routing branch only exists when KA or Vector Search appears in selected_tools[].

Evaluation runs (MLflow GenAI)

Canonical reference for evaluation execution, threshold gating, human feedback, and repeatability when using mlflow.genai.evaluate() on Databricks. Grounded in the official MLflow 3 GenAI eval harness, evaluation runs, conversation evaluation, and human feedback documentation.

Upstream Lineage

This skill extends AI-Dev-Kit's databricks-mlflow-evaluation skill for mlflow.genai.evaluate() execution, regression detection, production trace re-scoring, human labeling loops, and evaluation result analysis. If the eval harness contract or result object behavior is ambiguous, consult the upstream skill first, then preserve this skill's SDLC telemetry and gate-capture requirements.

When to Use

Measure agent quality on a fixed benchmark with mlflow.genai.evaluate().
Wire predict_fn, scorers, and dataset—or use answer sheet mode with pre-computed outputs.
Gate promote/deploy decisions on judge scores vs thresholds.
Add retries for transient harness or infrastructure failures.
Re-score existing outputs with new scorers or evaluate pre-collected production traces.
Run labeling sessions and sync human annotations into metrics.

Core Evaluation Flow

import mlflow

results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=agent_predict_fn,
    scorers=scorer_list,
)

The eval harness runs your predictor (if provided), attaches traces, applies scorers, and returns structured results. See Evaluation runs and Eval harness.

Trace destination (Unity Catalog)

Configure UC trace storage once before evaluate() so traces persist for labeling and dashboards. See Skill 07 (Production Monitoring) → Trace Destination for the full set_experiment(trace_location=UnityCatalog(...)) pattern and UC permissions.

predict_fn Contract

MLflow passes one row’s inputs into predict_fn (as a dict). The provided evaluation harness (run_evaluation.py) accepts both str and dict returns — it normalizes dicts to strings automatically via the _out() wrapper. Built-in scorers (Safety, RelevanceToQuery) work with traces, so a simple -> str return is sufficient.

predict_fn is optional when data already includes an outputs column (see Answer sheet mode below).

Using the provided harness (recommended)

Your track's predict_fn(inputs: dict) -> str works as-is with run_evaluation.py:

# --experiment-path MUST be the user-and-use-case-pinned eval experiment, e.g.
#   /Users/<user_email>/mlflow/<APP_NAME>-eval
# Read it from .vibecoding-state.md (mlflow_experiment_path with -eval leaf swap)
# instead of using a literal /Shared/my-agent/traces.
uv run run_evaluation.py --predict-module predict_fn.py \
  --experiment-path /Users/<user_email>/mlflow/<APP_NAME>-eval \
  --dataset-table catalog.schema.benchmarks \
  --thresholds '{"safety/mean": 0.7, "relevance_to_query/mean": 0.7}'

Returning dicts for richer scorer input (advanced)

If calling mlflow.genai.evaluate() directly (without the harness), return a dict to pass richer data to custom scorers:

def make_eval_predict_fn(track_fn):
    """Adapts a track callable for direct mlflow.genai.evaluate() use with dict return."""

    def predict_fn(inputs: dict) -> dict:
        question = inputs["question"]
        response = track_fn(question)
        return {"response": response}

    return predict_fn

Three common dict shapes (for direct evaluate() use)

Simple Q&A — (inputs) -> {"response": str}

def predict_fn(inputs: dict) -> dict:
    q = inputs["question"]
    return {"response": my_agent.answer(q)}

RAG — include retrieval for context-aware scorers:

def predict_fn(inputs: dict) -> dict:
    chunks = retriever.search(inputs["question"])
    answer = my_agent.answer(inputs["question"], chunks)
    return {"response": answer, "retrieved_context": chunks}

Conversation — inputs includes message history; same return shape:

def predict_fn(inputs: dict) -> dict:
    messages = inputs["messages"]  # multi-turn history
    return {"response": my_agent.chat(messages)}

For multi-turn evaluation patterns, see Evaluate conversations.

`Correctness` consumes `expected_response`

The built-in Correctness scorer reads expectations["expected_response"] from the dataset row — not expected_signal, expected_answer, or any other field name. If your benchmark stores the gold answer under a different key, either rename the column to expected_response or pass targets="expectations/<your_field>" explicitly when constructing the scorer:

from mlflow.genai.scorers import Correctness

# Default: reads expectations["expected_response"]
correctness = Correctness()

# Explicit target if your dataset uses a different field name
correctness = Correctness(targets="expectations/expected_response")

Mismatched field names produce silent None rows (the scorer skips, no error) and break threshold gates. See Skill 02 (Evaluation Datasets) for the canonical expected_response field on dataset rows and Skill 03 (Scorers and Judges) for the matching scorer contract.

Answer sheet evaluation mode

When data includes both outputs and expectations (and optionally other columns the scorers need), you can call evaluate() without predict_fn. The harness scores existing outputs—useful to re-score with new scorers or evaluate pre-collected production traces.

import mlflow

eval_df = ...  # columns: inputs, outputs, expectations (and any scorer inputs)

eval_result = mlflow.genai.evaluate(
    data=eval_df,
    scorers=scorer_list,
    # predict_fn omitted — outputs column supplies model outputs
)

This matches the harness behavior described in Eval harness: pre-computed predictions are scored directly.

Retry wrapper

Wrap mlflow.genai.evaluate() with retry and exponential backoff for transient failures (timeouts, rate limits, intermittent worker errors).

import time

TRANSIENT_MARKERS = ("timeout", "temporarily unavailable", "rate limit", "503", "504")


def is_retryable_error(exc: BaseException) -> bool:
    msg = str(exc).lower()
    return any(m in msg for m in TRANSIENT_MARKERS)


def evaluate_with_retry(data, scorers, predict_fn=None, max_retries=4, base_sleep_s=10):
    for attempt in range(max_retries):
        try:
            kwargs = {"data": data, "scorers": scorers}
            if predict_fn is not None:
                kwargs["predict_fn"] = predict_fn
            return mlflow.genai.evaluate(**kwargs)
        except Exception as e:
            if attempt >= max_retries - 1:
                raise
            if not is_retryable_error(e):
                raise
            time.sleep(base_sleep_s * (attempt + 1))

Tune TRANSIENT_MARKERS, worker env vars, and max workers per your environment. Optionally fall back to sequential row-by-row evaluation if batch mode keeps failing.

Handling None traces

If predict_fn raises for a row, the harness may record None for that row’s trace. Do not assume every index has a valid trace.

for i, tr in enumerate(eval_result.traces or []):
    if tr is None:
        # log row id, skip tagging, or collect for retry
        continue
    # use tr.trace_id, etc.

Extracting results

After evaluate(), use the returned object’s fields (names align with Evaluation runs):

eval_result = mlflow.genai.evaluate(...)

metrics = eval_result.metrics  # aggregate scorer metrics
traces = eval_result.traces  # per-row traces; entries may be None
table = eval_result.eval_table  # tabular join of inputs, outputs, scores

Prefer eval_table for threshold checks on per-row or aggregated columns.

Score normalization

Two scales often appear in the same pipeline:

Role	Typical scale	Example
Harness / MLflow metric columns	0–1 means (`metric/mean`)	`relevance/mean`
Product thresholds in “points”	0–100 per judge	dashboard gates

Normalize before comparing: multiply 0–1 by 100 when your gates expect 0–100, or divide thresholds by 100 when comparing to 0–1 columns. Keep one mental model per gate so you never compare raw 0–1 scores to 0–100 thresholds without conversion.

Threshold gate checks

See Skill 03 (Scorers and Judges) → Threshold Checking for the all_thresholds_met() pattern and normalization helpers. Log pass/fail and persist thresholds_met on the MLflow run for auditability.

Failure shape router (normative)

When a scored evaluation fails its gate, the next iteration step depends on what kind of failure it was — not just which scorer regressed. Emit a failure_shape_classification payload alongside threshold results, and route iteration based on primary_shape:

failure_shape_classification:
  primary_shape: enum  # one of: instruction | tool_call_empty | retrieval | scorer_calibration | safety_classifier
  failing_scorers_if_regressed: [string]
  l1_failures: [string]              # L1 = architecture-level (system prompt, role binding, refusal)
  failing_trace_ids:
    - trace_id: string
      failing_scorers: [string]
      predict_fn_status: string      # ok | exception | sentinel

Routing rules

`primary_shape`	Route to	Pre-condition
`instruction`	Skill 08b (prompt hand-authoring)	Only if `l1_failures` is empty. If L1 failures exist, route to architecture / system-prompt redesign instead — do not paper over an L1 failure with prompt iteration.
`tool_call_empty`	Skill 06 direct trace debug	Symptoms: `UNRESOLVED_COLUMN.WITH_SUGGESTION`, `TABLE_OR_VIEW_NOT_FOUND`, permission-denied, or empty tool output. Fix the data/grant/SQL-grounding issue before re-running eval.
`retrieval`	Retrieval tuning (chunking, reranker, top-k, embeddings)	Failing scorers are retrieval-shaped (`groundedness`, `retrieval_relevance`, `context_precision`). Do not iterate the system prompt.
`scorer_calibration`	Skill 03 (Scorers and Judges)	Judge disagrees with human labels at >X%. Fix the scorer prompt, aggregation, or `feedback_value_type` before treating the eval signal as ground truth.
`safety_classifier`	Endpoint audit and role re-binding	Safety scorer regressed because the scoring endpoint is the wrong model or hit a guardrail. Audit `llm_role_endpoints.llm_judge_safety` binding before iterating the agent.

Hard rule: never route an L1 failure to Skill 08b. L1 means architecture/role-binding/refusal — instruction iteration cannot fix it. Mis-routing here is the single most expensive failure mode in the SDLC.

Eval telemetry contract (normative)

Every scored evaluation run must capture and persist the following fields on the MLflow run (as tags, params, or artifact JSON — pick one and stay consistent):

Field	Type	Meaning
`failing_trace_ids`	`[{trace_id, failing_scorers, predict_fn_status}]`	Per-row failure detail. Required for routing and for re-running iteration on a focused subset.
`safety_buffer`	`{<scorer_name>: float}`	Margin between observed metric and gate threshold (positive = passing with headroom; negative = failing). Lets the next iteration step know how close to the cliff each scorer is.
`predict_fn_exception_count`	`int`	Total rows where `predict_fn` raised. Non-zero values mean some scorer means are computed over a smaller denominator than the dataset row count.
`predict_fn_sentinel_count_per_run`	`int`	Rows that returned a sentinel string (e.g. `"INPUT_GUARDRAIL_BLOCKED"`, `"LAKEBASE_COLD_START_FAILED"`). Not the same as exceptions — sentinels successfully return but represent productized debt. See `debt: predict_fn_input_guardrail_sentinel`.
`judges_with_silent_aggregation_dropouts`	`[string]`	Judges where `<scorer>/mean` is missing from `eval_result.metrics` because aggregation defaulted to `[]`. Must be empty before promoting. See Skill 03 → make_judge aggregation contract.
`mlflow_eval_predict_fn_signature`	`string`	The exact signature the harness saw, e.g. `(inputs: dict) -> str` or `(inputs: dict) -> dict`. Captured because mismatched signatures are the most common cause of empty / `None` traces.
`mlflow_eval_known_quality_issues`	`[{issue_id, owner_prompt_role, target_prompt_role, status}]`	Open quality issues against this evaluation run. If any item has `target_prompt_role: first_scored_eval` and `status != closed`, the gate must fail closed until the item is resolved or explicitly waived.

Capture in a single payload per run so dashboards and the iteration router can read it as one unit:

import json
import mlflow

eval_telemetry = {
    "failing_trace_ids": [...],
    "safety_buffer": {"safety/mean": 0.04, "correctness/mean": -0.12},
    "predict_fn_exception_count": 0,
    "predict_fn_sentinel_count_per_run": 2,
    "judges_with_silent_aggregation_dropouts": [],
    "mlflow_eval_predict_fn_signature": "(inputs: dict) -> str",
    "mlflow_eval_known_quality_issues": [],
}

mlflow.log_dict(eval_telemetry, "eval_telemetry.json")
mlflow.set_tag("predict_fn_exception_count", str(eval_telemetry["predict_fn_exception_count"]))
mlflow.set_tag("predict_fn_sentinel_count_per_run", str(eval_telemetry["predict_fn_sentinel_count_per_run"]))

Downstream iteration prompts (first_scored_eval, instruction_iteration) consume this payload directly — do not re-derive it.

Vibecoding state: `evaluation_runs_preflight` (producer)

This skill is the producer of the evaluation_runs_preflight.predict_fn_signature_matches_runner boolean read by preflight_check_registry.predict_fn_signature_matches_runner (see skills/vibecoding-state/references/spec-schema.md § Evaluation Runs Preflight). After every scored run, in addition to writing the eval_telemetry.json payload above, write the namespaced state block to the live state file:

# state-file shape (set by this skill after every scored run)
evaluation_runs_preflight:
  predict_fn_signature_matches_runner: <bool>     # see derivation below
  last_run_at: "<ISO8601 UTC>"

Derivation. predict_fn_signature_matches_runner == true iff the captured mlflow_eval_predict_fn_signature matches the runner-expected (inputs: dict) -> str or (inputs: dict) -> dict shape. A run whose captured signature does not match flips the boolean back to false so the registry check halts the next prompt in blocks_prompt_roles[] (local_eval_smoke, first_scored_eval) until the operator fixes the signature and re-runs. last_run_at always records the most recent write.

This replaces the ad-hoc top-level predict_fn_signature_matches_runner_status enum seeded earlier in Phase 1.7 — that name was non-namespaced and stringly.

Repeatability

Run the same benchmark multiple times; compare variance in outputs or scores.
Store references from a baseline run (e.g. prior SQL or result hashes) in expectations for dedicated repeatability scorers.
Set explicit targets for stability (e.g. minimum agreement rate) in your config, not hard-coded in library code.

Human feedback

Scope of this section: expert / SME labeling sessions for ground-truth calibration. For end-user thumbs-up/down or rating feedback collected from a deployed app and written back via mlflow.log_feedback(...), see 04c-end-user-feedback. The two are complementary: end-user signal in 04c surfaces what to label; expert sessions here resolve what is correct.

Human labels are ground truth for calibrating automated scorers and settling disagreements. Databricks MLflow GenAI supports label schemas, labeling sessions, and the MLflow Review App—a built-in UI where reviewers open traces, apply labels, and resolve disagreements. You can also use a customizable Review App template when you need branding or workflow-specific layouts; both paths feed the same labeling session APIs below.

See Human feedback and Label existing traces.

Label schemas

Define what reviewers capture. InputCategorical enforces a closed set (fast analytics, high agreement). InputText is free-form (rationales, failure modes). InputTextList collects multiple strings (e.g. multiple issues per trace).

from mlflow.genai.label_schemas import InputCategorical, InputText, InputTextList

relevance_schema = InputCategorical(
    name="relevance",
    title="Is the response relevant?",
    options=["relevant", "partially_relevant", "irrelevant"],
)
notes_schema = InputText(name="notes", title="Reviewer notes")
issues_schema = InputTextList(
    name="failure_modes",
    title="What went wrong? (add one per line)",
)

Keep names stable across sprints so you can trend metrics over time.

Labeling session

Create a session per review wave (sprint, model version, or incident). Tie it to the registered model (Unity Catalog three-level name) that reviewers associate with the agent.

import pandas as pd
from mlflow.genai.labeling import create_labeling_session

session = create_labeling_session(
    name="sprint-42-review",
    model_name=f"{catalog}.{schema}.my_agent",
    label_schemas=[relevance_schema, notes_schema, issues_schema],
)

`LabelingSession.add_traces` contract

session.add_traces(...) requires a pandas DataFrame whose rows include the trace IDs to label. Passing a bare list of strings (session.add_traces(["tr-1", "tr-2"])) does not work — the call appears to succeed but the session is empty in the Review App.

# CORRECT — DataFrame with a trace_id column
trace_df = pd.DataFrame({"trace_id": [tr.trace_id for tr in eval_result.traces if tr is not None]})
session.add_traces(trace_df)

# WRONG — silently no-ops (or errors depending on version)
# session.add_traces(trace_ids=["tr-1", "tr-2"])

Trace IDs typically come from eval_result.traces or from production logging. Ensure mlflow.set_trace_destination(UnityCatalog(...)) matches where those traces were written.

Sync annotations

After reviewers finish in the Review App (or your custom template backed by the same session):

session.sync(to_dataset=f"{catalog}.{schema}.benchmarks")

`session.sync` does not merge `log_feedback` labels

Critical contract gap: session.sync(to_dataset=...) only writes the label-schema responses captured inside the Review App back to the dataset's expectations. It does not merge labels written separately via mlflow.log_feedback(trace_id=..., name=..., value=...) — those remain attached to the trace as feedback assessments and are silently dropped by sync().

If reviewers use the in-app thumbs/categorical schema and you also call mlflow.log_feedback(...) from your own tooling (custom Review App templates, end-user feedback, batch labeling scripts), you must run a second pass to merge the feedback labels into the dataset:

def merge_records_from_session(session_id: str, dataset: str) -> int:
    """Merge `log_feedback` assessments from a labeling session into dataset expectations.

    `session.sync()` only propagates label-schema responses. Feedback assessments
    written via `mlflow.log_feedback(...)` are not merged automatically — this helper
    closes that gap by reading feedback off each trace in the session and writing
    the values back into the dataset row's `expectations`.

    Returns the number of rows updated.
    """
    import mlflow
    from mlflow.genai.datasets import get_dataset

    session = mlflow.genai.labeling.get_labeling_session(session_id)
    ds = get_dataset(dataset)
    updated = 0
    for trace_id in session.trace_ids():
        trace = mlflow.get_trace(trace_id)
        feedback = {a.name: a.value for a in (trace.info.assessments or []) if a.source.source_type == "HUMAN"}
        if not feedback:
            continue
        ds.merge_records([{"trace_id": trace_id, "expectations": feedback}])
        updated += 1
    return updated

merged = merge_records_from_session(session.session_id, f"{catalog}.{schema}.benchmarks")

Always run session.sync(to_dataset=...) first, then merge_records_from_session(...) second. The two are complementary, not redundant.

Labeling-session quality gates

Gate the iteration on label coverage and quality, not just label count. Capture and threshold:

Field	Type	Gate
`human_label_count`	`int`	At least one human label per scorer you intend to calibrate.
`synthesized_label_count`	`int`	LLM-as-judge synthesized labels (e.g. from a hand-authored gold-set bootstrapper). Must be tagged distinctly so they don't masquerade as human truth.
`comment_text`	`string` (per row)	Reviewer rationale. Required on every label that disagrees with the automated judge — disagreement without rationale is unactionable.
Time-to-label distribution	`{p50_seconds, p95_seconds, n}`	If p95 is very low (seconds per trace) reviewers are skimming; if p95 is very high (>20 min) the schema is too complex. Both are signals to redesign before trusting the labels.

# Persist alongside eval telemetry
labeling_telemetry = {
    "human_label_count": 87,
    "synthesized_label_count": 0,
    "labels_missing_comment_on_disagreement": 4,  # MUST be 0 before promote
    "time_to_label_p50_seconds": 32,
    "time_to_label_p95_seconds": 180,
    "time_to_label_n": 87,
}
mlflow.log_dict(labeling_telemetry, "labeling_telemetry.json")

Use synced labels to compute agreement with automated judges, find systematic false positives/negatives, and adjust scorer prompts or thresholds.

Key insight: Human feedback is the calibration standard for automated scorers—schedule it like you schedule regression tests, not as a rare audit.

Conversation evaluation runs

Pass conversation-shaped inputs (e.g. messages) and a predict_fn that consumes full history, or
Supply traces that already contain multi-turn spans and use answer-sheet style scoring where applicable.

Details and examples: Evaluate conversations.

Feedback logging on traces

def log_tags_on_traces(trace_ids, tags_per_id: list[dict]):
    for tid, tags in zip(trace_ids, tags_per_id):
        if tid is not None:
            mlflow.update_trace(tid, tags=tags)

Use stable run/session tags (e.g. eval.session_id) so traces remain searchable after partial failures.

Run naming convention

Use structured, unique run names for experiments and dashboards:

import datetime

app_name = "my-agent"
version = "1.4.2"
run_name = f"{app_name}-eval-v{version}-{datetime.datetime.utcnow():%Y%m%d_%H%M%S}"

Include iteration or git SHA when comparing CI runs.

DO / DON'T

Topic	DO	DON'T
predict_fn input	Match `inputs` keys to what you read inside `predict_fn`	Treat the full row as the first argument if MLflow passes only `inputs`
Scales	Convert 0–1 vs 0–100 consistently before gates	Compare MLflow `/mean` columns to 0–100 thresholds without conversion
Retries	Wrap `evaluate()` with backoff and retryable detection	Single unguarded `evaluate()` in flaky environments
Answer sheet	Omit `predict_fn` when `outputs` is pre-filled	Require a live model when you only want to re-score
Traces	Check for `None` before `update_trace`	Assume every row produced a trace
Human feedback	Define schemas + `create_labeling_session` + `sync()`	Rely only on ad-hoc spreadsheets disconnected from traces

Common mistakes

Mistake	Consequence
Mismatched `inputs` keys and `predict_fn`	Scorer or predictor errors, empty responses
Ignoring `None` traces	Failed tagging, misleading trace counts
Wrong score scale at gate	False pass/fail
No retry on transient errors	Noisy CI, aborted batches
Skipping human calibration	Drift between judges and user-perceived quality

Validation checklist

Dataset rows include inputs (and outputs / expectations when using answer sheet mode).
predict_fn return dict keys match scorer contracts (e.g. response, retrieved_context).
Correctness reads expectations["expected_response"] — field name verified, no expected_signal aliasing.
Trace destination set if using UC-backed review or labeling.
Thresholds and metric columns use a single agreed scale per check.
Retry/backoff in place for production or CI evaluation jobs.
eval_result.metrics, eval_table, and traces inspected for the metrics you gate on.
failure_shape_classification emitted with primary_shape, failing_scorers_if_regressed, l1_failures, and failing_trace_ids.
Iteration routing follows the failure-shape table: instruction (no L1) -> 08b; tool_call_empty -> 06 direct trace debug; retrieval -> retrieval tuning; scorer_calibration -> Skill 03; safety_classifier -> endpoint audit.
Eval telemetry payload logged: failing_trace_ids, safety_buffer, predict_fn_exception_count, predict_fn_sentinel_count_per_run, judges_with_silent_aggregation_dropouts, mlflow_eval_predict_fn_signature, mlflow_eval_known_quality_issues.
judges_with_silent_aggregation_dropouts is empty before promoting.
Optional: labeling schemas and session.sync() for human-in-the-loop calibration.
If labeling session used: session.add_traces(...) was passed a pandas DataFrame (not a list of strings).
If labeling session used: ran session.sync(to_dataset=...) and then merge_records_from_session(...) to capture log_feedback labels.
Labeling telemetry captured: human_label_count, synthesized_label_count, comment_text on disagreements, time-to-label distribution.

References

Official Databricks documentation

Related skills (same repo)

Load on demand: references/evaluation-flow.md if you need the full evaluation execution sequence; references/gate-checking.md if threshold gating fails unexpectedly; references/repeatability.md if you need to measure score variance across runs.

Version History

Version	Date	Changes
4.1.0	2026-04-26	Added failure-shape router (instruction / tool_call_empty / retrieval / scorer_calibration / safety_classifier) with hard rule against routing L1 failures to Skill 08b. Added normative eval telemetry contract (`failing_trace_ids`, `safety_buffer`, `predict_fn_exception_count`, `predict_fn_sentinel_count_per_run`, `judges_with_silent_aggregation_dropouts`, `mlflow_eval_predict_fn_signature`, `mlflow_eval_known_quality_issues`). Documented `LabelingSession.add_traces` DataFrame contract, the `session.sync` / `log_feedback` merge gap, and the `merge_records_from_session(session_id, dataset)` helper. Added `Correctness` -> `expected_response` field-name normative section. Added labeling-session quality gates (`human_label_count`, `synthesized_label_count`, `comment_text`, time-to-label distribution). Closes retrospective bugs: `LabelingSession.add_traces requires DataFrame`, `session.sync does not merge log_feedback labels`, `Correctness uses expected_response` (Skill 04 row).
4.0.0	2026-04-10	De-coupled from repo-specific code. Added answer sheet evaluation, human feedback/labeling sessions, conversation evaluation. Grounded in official eval-harness, evaluation-runs, and human-feedback docs.
3.1.0	2026-03-27	Added reference files; scripts section; expanded DO/DON'T and references
3.0.0	2026-03-25	Initial structured skill