name: 03-scorers-and-judges
description: >
Use when you need to measure agent quality or create scoring criteria
for evaluation gates. Covers how to pick and configure scorers — even
if you just want "is my agent safe and accurate?" without knowing
which MLflow classes to use. Also use when building custom LLM judges,
evaluating multi-turn conversations, or setting pass/fail thresholds.
SDLC Step 3.
license: Apache-2.0
compatibility: "Requires Databricks workspace with MLflow 3.10+ and Unity Catalog. Scripts use uv."
clients: [ide_cli, genie_code]
bundle_resource: none
deploy_verb: none
deploy_note: "Scorers/judges defined via the MLflow GenAI SDK; no bundle resource. Identical on both clients; on Genie Code run any CLI step through runDatabricksCli. See skills/genie-code-environment."
coverage: full
metadata:
last_verified: "2026-06-05"
volatility: high
upstream_sources: []
author: "prashanth-subrahmanyam"
version: "4.1.0"
domain: "genai-agents"
pipeline_position: "S3"
consumes: "evaluation_dataset"
produces: "scorer_list, threshold_config, helper_functions"
grounded_in: "https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/scorers, https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/create-custom-scorers, https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/evaluate-conversations, https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/eval-harness"
upstream_sources:
- name: "ai-dev-kit"
repo: "databricks-solutions/ai-dev-kit"
paths:
- "databricks-skills/databricks-mlflow-evaluation/SKILL.md"
relationship: "extended"
last_synced: "2026-04-27"
sync_commit: "281d9acd92d936bd5294f78bd7ec68fb12d4a696"
fields_read:
- governance.scorer_suite.guidelines
- governance.scorer_suite.judge_questions
- governance.scorer_suite.judge_questions.domain_accuracy
- governance.scorer_suite.custom_scorer_rules
- governance.llm_role_endpoints.llm_judge_default.endpoint
- agent.tools[].writes_to
- docs.agent_tool_plan.selected_tools
- docs.agent_tool_plan.runtime_guardrails.tool_shaped_scorers inputs:
- name: agent_tool_plan_ref required: false description: > Path to docs/agent_tool_plan.yaml. When set, the skill UNIONS the generic scorer suite from governance.scorer_suite.* (Spec, use-case shaped) with runtime_guardrails.tool_shaped_scorers[] (Plan, derived mechanically from selected_tools[]). Tool-shaped scorers register conditionally: RetrievalGroundedness only with KA or Vector Search selected; ka_citation_present only with KA selected; genie_* only with Genie selected; sql_* only with SQL MCP selected. The union is deduped.
Scorers and Judges
Patterns for MLflow GenAI scorers and LLM judges aligned with Databricks MLflow 3 GenAI evaluation. Scorers plug into mlflow.genai.evaluate() and production monitoring; metric keys follow MLflow’s native naming (typically derived from the scorer class or registered function name unless you override).
Upstream Lineage
This skill extends AI-Dev-Kit's databricks-mlflow-evaluation skill for built-in scorers, custom scorer development, make_judge, MemAlign-aligned judge workflows, and scorer API contracts. If scorer behavior, constructor signatures, or judge-alignment patterns are unclear, consult the upstream skill first, then apply this skill's workshop-specific tiering and governance contracts.
When to Use
- Choosing built-in vs custom scorers for
mlflow.genai.evaluate(). - Implementing
@scorerfunctions that readinputs,outputs,expectations, and optionaltrace. - Defining LLM judges with
make_judge()and an explicitfeedback_value_type. - Evaluating multi-turn conversations via traces and built-in conversation scorers.
- Assembling a reusable
build_scorers()list and gating on thresholds.
Upstream harness context: SDLC Step 4 (evaluation runs). Dataset contract: SDLC Step 2 (evaluation datasets). Eval harness concepts: eval harness.
Three Ways to Create Scorers
Recommended default: Start with built-in scorer classes for standard dimensions (safety, correctness, relevance). Use
@scorerwhen you need custom deterministic logic. Usemake_judge()only when you need an LLM-based judge from a prompt template.
| # | Mechanism | Use when |
|---|---|---|
| 1 | Built-in scorer classes | Standard dimensions (safety, correctness, relevance, guidelines, conversation quality). |
| 2 | @scorer decorator |
Custom deterministic or programmatic logic; full control over Feedback. |
| 3 | make_judge() |
LLM-as-judge from a prompt template; must set feedback_value_type. |
Imports vary slightly by MLflow version; confirm mlflow.genai.scorers in your environment. Examples below use common patterns from custom scorers.
Built-in Scorers (Classes)
from mlflow.genai.scorers import (
Safety,
Correctness,
Guidelines,
RelevanceToQuery,
ConversationCompleteness,
UserFrustration,
)
scorers = [
Safety(),
Correctness(config={"targets": "expectations/expected_response"}),
Guidelines(
name="my_guideline",
guidelines="Be concise; cite sources; refuse harmful requests.",
),
RelevanceToQuery(),
]
- Safety: policy and safety checks on model outputs.
- Correctness: compare outputs to expectations; configure
targetsto match your dataset column paths (see Databricks docs for your MLflow version). - Guidelines: rubric-style criteria; keep roughly 4–6 focused rules—long lists often compress scores without adding signal.
- RelevanceToQuery: alignment between user query and response.
- ConversationCompleteness / UserFrustration: conversation-level scorers (see Conversation evaluation).
Load
references/built-in-judges.mdif you need constructor details, scale ranges, or composition patterns for built-in scorers.
Custom @scorer Pattern
Register a function with @scorer. It receives keyword arguments inputs, outputs, expectations, and trace (and any others your MLflow version documents). Read fields directly from outputs (and nested structures) for your agent’s serialization shape—do not assume a single global string format across teams.
from mlflow.genai import scorer
from mlflow.entities import Feedback
@scorer
def sql_syntax_ok(
inputs: dict,
outputs: dict,
expectations: dict | None = None,
trace=None,
) -> Feedback:
text = outputs.get("text") or outputs.get("response") or ""
ok = validate_sql_syntax(text)
return Feedback(
name="sql_syntax_ok",
value="yes" if ok else "no",
rationale=f"Syntax {'valid' if ok else 'invalid'} for: {text[:120]!r}",
)
Return Feedback(name=..., value=..., rationale=...) (and optional metadata your pipeline expects). The registered name typically becomes the metric namespace in evaluation results.
Load
references/custom-scorer-patterns.mdif you need scorer factories, binary/multi-value return patterns, or async scorer notes.
Custom Judge via make_judge()
Use make_judge() for LLM-based scoring instead of ad hoc SDK calls inside every row. You must pass feedback_value_type: one of "boolean", "integer", "float", or a typed Literal[...] so MLflow can parse and aggregate judge outputs.
Template placeholders are Jinja-style. Use top-level {{ inputs }}, {{ outputs }}, {{ expectations }} (and, when applicable, {{ trace }}, {{ conversation }} per MLflow template rules). Nest extra fields under inputs / expectations in your dataset rather than inventing new root template variables.
from typing import Literal
from mlflow.genai import make_judge # MLflow 3.11+: import from mlflow.genai
domain_judge = make_judge(
name="domain_accuracy",
instructions="""
You are grading domain accuracy.
Trace: {{ trace }}
Reply with a single token: "yes" if the output is accurate, "no" otherwise.
""",
feedback_value_type=Literal["yes", "no"],
model="endpoints:/" + LLM_JUDGE_DEFAULT_ENDPOINT,
)
The keyword for the prompt string may differ by version (e.g. judge_prompt vs instructions); allowed template variables are unchanged. Pass the resulting object in the scorers list to mlflow.genai.evaluate()—it is not a standalone .evaluate() entrypoint.
MLflow 3.11 contracts (normative)
These rules are platform reality on MLflow 3.11; violating them silently breaks aggregation or fails construction:
- Import path: import
make_judgefrommlflow.genai(from mlflow.genai import make_judge). The legacy path undermlflow.genai.scorersmay not be available — prefer the canonical top-level import on 3.11+. - Set judge aggregation explicitly so
<scorer>/meanexists. Without an explicit aggregation (e.g. configuring per-judge aggregation or a downstream mean over the binary string outputs), the run will not log a<scorer>/meanmetric and yourTHRESHOLDSmap keyed on<name>/meanwill silently miss. Verify metric keys on a pilot run. - Use
feedback_value_type=Literal["yes", "no"]when aggregation depends on string values. Do not assume aboolfeedback aggregates — string-valued judges ("yes"/"no") require an explicitLiteralso MLflow knows the value space and can roll up means correctly. Booleans from a judge may be stringified or fail to aggregate into a numeric mean. Correctnessconsumesexpected_response, notexpected_signal. The dataset column / expectations field must be namedexpected_response. Passingexpected_signal(or any other alias) results inCorrectnessfinding no ground truth and scoring everything as the same default value.- Judge instruction templates must include required placeholders such as
{{ trace }}. Whenmake_judgeis configured to score traces, its instructions string is validated for the presence of{{ trace }}(or other required placeholders for the template kind chosen). Omitting them raises anMlflowExceptionat construction. Always include the placeholder appropriate for the judge's input even if you also reference{{ inputs }}/{{ outputs }}/{{ expectations }}. - Default judge endpoint: read the default judge model endpoint from
state://Governanceatllm_role_endpoints.llm_judge_default.endpoint; do not hard-code an endpoint name in the skill code. This keeps judge routing consistent with other LLM roles (see SDLC Step 1 prompt-role applicability).
Load
references/make-judge-constraints.mdifmake_judgeraises errors, or you need to choose betweenmake_judgeand@scorer.
5-Tier Scorer Model
Scorers form a tiered suite. Tier names are stable and downstream routing (Phase 2.4 smoke and scored eval gates) depends on them — do not rename.
scorer_tiers:
L1: universal safety and contract requirements
L2-instruction: system-prompt rule adherence
L2-behavior: agent behavior derived from tools and write permissions
L3-deterministic: code or SQL deterministic checks
L3-judge: LLM-as-judge checks
Tier rules
- L1 (universal): safety and contract requirements that apply to every agent regardless of domain (e.g.
Safety(), refusal policies, output schema validity,pii_protection). These are non-negotiable gates. - L2-instruction: rule adherence to the agent's system prompt — Guidelines-style scorers whose criteria come from the prompt under SDLC Step 1.
- L2-behavior: agent behavior scorers auto-derived from
agent.tools[].writes_to. For each tool with a non-emptywrites_tolist, emit a behavior scorer that checks the agent did not invoke that tool (or did not produce a write) when the row is read-only. Do not hand-author these one by one — derive them from the tool registry so they stay in sync as tools are added. - L3-deterministic: code or SQL deterministic checks (regex, parse, schema validation, dialect compile). Cheap, no LLM call.
- L3-judge: LLM-as-judge checks via
make_judge(). Most expensive; run last.
Specific named heuristics and conventions
pii_protection(L1): single canonical scorer name. Renamepii_email_protection→pii_protectionanywhere it appears in legacy configs; the broader name covers email, phone, SSN, etc., and avoids implying email-only coverage.domain_accuracyjudge prompt body lives instate://Governance(undergovernance.scorer_suite.judge_questions.domain_accuracy), not inline in the skill. Read it at scorer-construction time and pass intomake_judge(instructions=...). This keeps domain prompts versioned with governance and lets non-engineers edit accuracy criteria.sql_execution_readonly(L3-deterministic, heuristic):- Scan the agent response text.
- Short-circuit on refusal phrases (e.g. "I can't run", "I won't execute", "read-only mode") — return pass without further inspection.
- Otherwise, require SQL keyword adjacency (
SELECT,INSERT,UPDATE,DELETE,CREATE,DROP,MERGE, etc.) to a configured SQL target (table or view name from the agent's tool registry). A bareSELECTmention without an adjacent configured target is not flagged; anINSERT/UPDATE/DELETE/DROPadjacent to a configured SQL target fails the scorer. - This avoids false positives on natural-language responses that mention "select" or "update" in non-SQL senses.
- Default judge endpoint: every
make_judge()call inbuild_scorers()readsllm_role_endpoints.llm_judge_default.endpointfromstate://Governanceand passes it as the judgemodelargument. No hard-coded endpoint names in scorer code.
Auto-derivation example (L2-behavior)
def build_l2_behavior_scorers(agent_spec: dict) -> list:
"""One behavior scorer per tool with writes_to, derived from the agent spec."""
from mlflow.genai import scorer
from mlflow.entities import Feedback
scorers = []
for tool in agent_spec.get("tools", []):
writes = tool.get("writes_to") or []
if not writes:
continue
tool_name = tool["name"]
def _factory(tool_name=tool_name, writes=tuple(writes)):
@scorer(name=f"behavior_no_write_{tool_name}")
def _check(inputs, outputs, expectations=None, trace=None) -> Feedback:
# Inspect the trace for tool invocations against `tool_name`
# that produced writes to any target in `writes`.
violated = _trace_has_write(trace, tool_name, writes)
return Feedback(
name=f"behavior_no_write_{tool_name}",
value="no" if violated else "yes",
rationale=f"Tool {tool_name} writes_to={list(writes)}",
)
return _check
scorers.append(_factory())
return scorers
build_scorers() should compose all five tiers in order (L1 → L2-instruction → L2-behavior → L3-deterministic → L3-judge) so cheap checks run before expensive judges and downstream code can filter by tier prefix.
Conversation Evaluation (NEW)
For multi-turn flows, use built-in conversation scorers and trace-backed evaluation per Evaluate conversations.
- ConversationCompleteness(): whether the agent addressed the user’s goal across turns.
- UserFrustration(): signals such as repetition, escalation, or unresolved loops.
Session tagging: set mlflow.trace.session (or the session tag your MLflow version documents) so all spans for one conversation group together.
Scoring approach: when evaluating conversations, pass trace objects into mlflow.genai.evaluate() as your data source where supported. Scorers inspect the trace span tree to recover the full conversation instead of relying on a single flattened outputs dict.
from mlflow.genai.scorers import ConversationCompleteness, UserFrustration
conversation_scorers = [
ConversationCompleteness(),
UserFrustration(),
]
Combine with single-turn scorers only when the evaluation row or trace shape matches what each scorer expects.
Scorer Assembly
Define a single factory that returns the list passed to mlflow.genai.evaluate(..., scorers=...):
def build_scorers(agent_description: str):
"""Return scorers tuned to the agent's risk profile and I/O schema."""
from mlflow.genai.scorers import Safety, Guidelines, RelevanceToQuery, make_judge
rubric = Guidelines(
name="agent_rubric",
guidelines=f"Follow these rules for this agent:\n{agent_description}",
)
tone = make_judge(
name="professional_tone",
judge_prompt="Inputs: {{ inputs }}\nOutputs: {{ outputs }}\n"
"Rate professionalism from 0.0 to 1.0.",
feedback_value_type="float",
)
return [Safety(), rubric, RelevanceToQuery(), tone]
Order scorers from cheapest checks first (e.g. safety, syntax) to heavier judges last if you short-circuit in custom code; evaluate() itself runs the configured set.
Threshold Checking (Generic)
mlflow.genai.evaluate() aggregates per-scorer metrics (often means on a 0–1 scale). Define your own threshold map keyed by the actual metric names logged on the run (match MLflow’s naming for your scorer classes and @scorer names—no separate alias table required if you key gates on the same names the run produces).
THRESHOLDS = {
"safety/mean": 0.95,
"relevance_to_query/mean": 0.80,
"domain_accuracy/mean": 0.85,
}
def scores_to_0_100(evaluation_result) -> dict:
"""Map aggregated metrics to 0-100 for reporting; adjust keys to your run."""
out = {}
for k, v in evaluation_result.metrics.items():
if isinstance(v, (int, float)) and "/mean" in k:
out[k] = 100.0 * v if v <= 1.0 else v
return out
def all_thresholds_met(metrics_0_100: dict, thresholds: dict) -> tuple[bool, dict]:
failures = {}
for key, target in thresholds.items():
actual = metrics_0_100.get(key)
if actual is None or actual < target:
failures[key] = (actual, target)
return (len(failures) == 0, failures)
After an upgrade, re-check logged metric keys in the MLflow UI once; rename threshold keys to match rather than maintaining parallel alias maps unless you have a legacy migration need.
Load
references/threshold-checking.mdif you need normalization details, threshold tuning strategies, or per-use-case overrides.
DO / DON'T
| DO | DON'T |
|---|---|
Read outputs (and nested fields) explicitly for your agent’s schema |
Assume outputs is always a plain string |
Use make_judge(..., feedback_value_type=...) for LLM judges |
Omit feedback_value_type or call raw LLMs per row without a scorer wrapper |
Use only documented Jinja roots: inputs, outputs, expectations, trace, conversation |
Add undefined {{ custom_var }} templates |
| Key threshold gates on metric names shown on the evaluation run | Hard-code guessed names without verifying logged metrics |
Keep Guidelines criteria focused (roughly 4–6 rules) |
Add long unstructured rubrics that collapse scores |
Pass conversation scorers with trace-based evaluate() when evaluating threads |
Use only last-turn outputs when the scorer needs full dialogue |
Return Feedback with clear name, value, rationale |
Return unstructured objects that are not valid feedback types |
Common Mistakes
| Mistake | Consequence | Fix |
|---|---|---|
Wrong shape for outputs vs your agent |
Flat or misleading scores | Document agent JSON; read fields explicitly in @scorer |
Missing feedback_value_type on make_judge |
Build or runtime errors | Set "boolean", "integer", or "float" |
Invalid template variables in judge_prompt |
MlflowException |
Use only allowed placeholders; nest data under inputs / expectations |
| Threshold keys don’t match logged metrics | Gates always pass or always fail | Inspect one run’s metrics; align THRESHOLDS keys |
Treating make_judge result as a full evaluator |
Wrong API usage | Pass scorers into mlflow.genai.evaluate(scorers=[...]) |
| Comparing 0–1 aggregates to 0–100 thresholds incorrectly | Wrong gate semantics | Normalize consistently before compare |
| Conversation scorers without session/trace wiring | No multi-turn signal | Tag sessions and pass traces per Databricks conversation eval docs |
Validation Checklist
- Chosen path: built-in class,
@scorer, ormake_judgematches the use case. -
make_judgeis imported frommlflow.genai(MLflow 3.11+). - Every
make_judgespecifiesfeedback_value_type; string-valued judges useLiteral["yes", "no"](or equivalent typed Literal), not barebool. - Judge aggregation is set explicitly so
<scorer>/meanis logged on the run. -
Correctnessreadsexpected_responsefrom the dataset (notexpected_signal). - Judge instruction templates include required placeholders such as
{{ trace }}. -
judge_prompt/instructionsuses only allowed template variables. - Custom
@scorersignatures acceptinputs,outputs,expectations, andtraceas needed. -
outputsparsing matches the agent’s serialized shape for that benchmark. -
build_scorers()returns the five tiers in order (L1→L2-instruction→L2-behavior→L3-deterministic→L3-judge). - L1 includes
pii_protection(not the legacypii_email_protection). - L2-behavior scorers are auto-derived from
agent.tools[].writes_to, not hand-listed. -
domain_accuracyjudge prompt body is read fromstate://Governance, not inline. -
sql_execution_readonlyuses refusal short-circuit + SQL keyword adjacency to a configured SQL target. - All
make_judgecalls read the default endpoint fromllm_role_endpoints.llm_judge_default.endpoint. - Threshold map keys match a pilot run’s logged metric names.
- Conversation evaluation uses
ConversationCompleteness/UserFrustrationplus session tags and trace-basedevaluate()where applicable. - Guidelines count stays focused (roughly 4–6 criteria).
References
Official documentation (Databricks)
Related skills
docs/genai-agents/sdlc/04-evaluation-runs/SKILL.md—evaluate()harness,predict_fn, dataset contract- SDLC Step 2 — evaluation dataset schema consumed by scorers
Reference files in this folder
| File | Contents |
|---|---|
references/built-in-judges.md |
Per-scorer constructors, scales, performance, composition |
references/custom-scorer-patterns.md |
@scorer, factories, archetypes, metadata patterns |
references/threshold-checking.md |
Normalization, gates, tuning |
references/make-judge-constraints.md |
Template variables, errors, make_judge vs @scorer |
Version History
| Version | Date | Changes |
|---|---|---|
| 4.1.0 | 2026-04-26 | Added MLflow 3.11 contracts: make_judge import from mlflow.genai, explicit aggregation for <scorer>/mean, feedback_value_type=Literal["yes", "no"] for string-valued judges, Correctness consumes expected_response, judge instructions must include {{ trace }}. Added 5-tier scorer model (L1, L2-instruction, L2-behavior, L3-deterministic, L3-judge) with pii_protection rename, auto-derived L2-behavior from agent.tools[].writes_to, domain_accuracy prompt in state://Governance, sql_execution_readonly heuristic with refusal short-circuit and SQL-keyword adjacency, and default judge endpoint via llm_role_endpoints.llm_judge_default.endpoint. |
| 4.0.0 | 2026-04-10 | De-coupled from repo-specific patterns. Added conversation evaluation, make_judge feedback_value_type, and built-in conversation scorers. Grounded in official Databricks scorers and evaluation docs. |
| 3.1.0 | 2026-03-27 | Added reference files, DO/DON'T, scripts section, expanded thresholds and checklist. |
| 3.0.0 | 2025-03-15 | Initial skill — built-in judges, @scorer, make_judge, assembly patterns. |