name: mlflow-genai-evaluation description: > MLflow 3 GenAI evaluation patterns with LLM judges, MemAlign judge alignment, GEPA prompt optimization, custom scorers with _extract_response_text() helper, Databricks SDK for scorer LLM calls, metric aliases, threshold checking, 4-6 guidelines best practice, foundation model endpoints. Use when implementing agent evaluation pipelines, creating custom LLM judges, aligning judges with domain feedback, optimizing prompts, setting up UC trace ingestion for production monitoring, or troubleshooting evaluation errors. license: Apache-2.0 metadata: author: prashanth subrahmanyam version: "1.2.0" domain: genai-agents role: worker pipeline_stage: 9 pipeline_stage_name: genai-agents called_by: - genai-agents-setup standalone: true last_verified: "2026-02-07" volatility: high upstream_sources: - name: "ai-dev-kit" repo: "databricks-solutions/ai-dev-kit" paths: - "databricks-tools-core/databricks_tools_core/sql/sql.py" relationship: "reference" last_synced: "2026-02-19" sync_commit: "97a3637" - name: "ai-dev-kit" repo: "databricks-solutions/ai-dev-kit" paths: - "databricks-skills/databricks-mlflow-evaluation/SKILL.md" relationship: "extended" last_synced: "2026-02-19" sync_commit: "latest"
MLflow GenAI Evaluation Patterns
Production-grade patterns for evaluating Databricks GenAI agents using MLflow 3.0+ mlflow.genai.evaluate() with LLM-as-judge scorers and custom evaluation metrics.
When to Use
- Implementing agent evaluation pipelines with LLM judges
- Creating custom domain-specific evaluation scorers
- Setting up evaluation datasets for agent testing
- Checking deployment thresholds before production deployment
- Troubleshooting evaluation errors (0.0 scores, metric name mismatches)
- Optimizing guidelines for better evaluation scores
- Querying evaluation results programmatically
- Aligning LLM judges with domain expert feedback via MemAlign
- Automated prompt optimization with GEPA (
optimize_prompts()) - Setting up Unity Catalog trace ingestion for production monitoring
Upstream API Note
The upstream databricks-mlflow-evaluation skill in AI-Dev-Kit covers 8 end-to-end workflows: first-time evaluation setup, production trace-to-dataset, performance optimization, regression detection, custom scorer development, UC trace ingestion and production monitoring, judge alignment with MemAlign, and automated prompt optimization with GEPA.
Critical API facts:
- Use
mlflow.genai.evaluate()(NOTmlflow.evaluate()) - Data format:
{"inputs": {"query": "..."}}(nested structure required) - MemAlign is scorer-agnostic (works with any
feedback_value_type) - GEPA optimization dataset must have both
inputsANDexpectationsper record - Requires MLflow >= 3.5.0 for
optimize_prompts()
⚠️ CRITICAL: Response Extraction Helper
MANDATORY: _extract_response_text() must be included in ALL custom scorers.
mlflow.genai.evaluate() serializes ResponsesAgentResponse to a dict before passing to scorers. Without proper extraction, scorers receive serialized dicts and return 0.0 scores (silent failure).
Full implementation: See scripts/evaluation_helpers.py for complete function code.
Why this matters:
- ❌ Without helper: 9+ custom scorers return 0.0 for ALL responses
- ❌ Silent failure - no error messages, just 0.0 scores
- ❌ Took 5+ deployment iterations to discover root cause
- ✅ With helper: Scorers work correctly first time
⚠️ CRITICAL: Databricks SDK for LLM Calls
ALWAYS use Databricks SDK (NOT langchain_databricks) for LLM calls in custom scorers.
| Issue | langchain_databricks | Databricks SDK |
|---|---|---|
| Serverless Compute | ❌ Package install failures | ✅ No install needed |
| Authentication | ❌ Varies by environment | ✅ Automatic in notebooks |
| Deployment Jobs | ❌ Unreliable auth | ✅ Reliable auth |
| Support | ⚠️ Community package | ✅ Official Databricks SDK |
Full implementation: See scripts/evaluation_helpers.py for _call_llm_for_scoring() helper.
Guidelines Best Practice: 4-6 Sections
CRITICAL: Keep guidelines to 4-6 essential sections (NOT 8+).
❌ DON'T: Too Many Guidelines Sections
# BAD: 8 comprehensive guidelines = low scores
guidelines = [
"Section 1: Response Structure (200 words)",
"Section 2: Data Accuracy (150 words)",
"Section 3: No Fabrication (180 words)",
"Section 4: Actionability (160 words)",
"Section 5: Domain Expertise (200 words)",
"Section 6: Cross-Domain Intelligence (150 words)",
"Section 7: Professional Tone (120 words)",
"Section 8: Completeness (170 words)",
]
# Result: guidelines/mean = 0.20 (too strict!)
✅ DO: 4-6 Essential Guidelines
# GOOD: 4 focused, critical guidelines = higher, more meaningful scores
guidelines = [
"""Data Accuracy and Specificity:
- MUST include specific numbers (costs, DBUs, percentages)
- MUST include time context (when data is from)
- MUST include trend direction (increased/decreased)""",
"""No Data Fabrication (CRITICAL):
- MUST NEVER fabricate numbers
- If Genie errors, MUST state explicitly""",
"""Actionability and Recommendations:
- MUST provide specific, actionable next steps
- MUST include concrete implementation details""",
"""Professional Enterprise Tone:
- MUST maintain professional tone
- MUST use proper formatting (markdown, tables)""",
]
# Result: guidelines/mean = 0.5+ (achievable, meaningful)
Why this matters:
- 8+ sections = overly strict scoring (0.20 average)
- 4-6 sections = achievable, meaningful scores (0.50+ average)
- Focus on critical quality dimensions only
Run Naming Convention
ALWAYS use consistent run naming for querying latest evaluation results.
# ✅ CORRECT: Consistent prefix + timestamp
from datetime import datetime
run_name = f"eval_pre_deploy_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
mlflow.genai.evaluate(
model=model_uri,
data=eval_dataset,
model_type="databricks-agent",
evaluators=evaluators,
evaluator_config=evaluator_config,
experiment_name="/Shared/health_monitor_agent_evaluation",
run_name=run_name, # ✅ Consistent naming
)
# Query latest evaluation
runs = mlflow.search_runs(
filter_string="tags.mlflow.runName LIKE 'eval_pre_deploy_%'", # ✅ Predictable
order_by=["start_time DESC"],
max_results=1
)
Why this matters:
- Automated checks can find latest evaluation results
- Consistent naming enables programmatic threshold validation
- CI/CD pipelines can query recent evaluation metrics
Quick Example: @scorer Decorator Pattern
import mlflow
from mlflow.models import Score
from mlflow.metrics import scorer
from typing import Dict, Optional
# Import helpers (from scripts/evaluation_helpers.py)
from evaluation_helpers import _extract_response_text, _call_llm_for_scoring
@mlflow.trace(name="cost_accuracy_judge", span_type="JUDGE")
@scorer
def cost_accuracy_judge(
inputs: Dict,
outputs: Dict,
expectations: Optional[Dict] = None
) -> Score:
"""
Custom judge evaluating cost accuracy in agent responses.
CRITICAL: Must use _extract_response_text() helper.
"""
# ✅ STEP 1: Extract response text (MANDATORY)
response_text = _extract_response_text(outputs)
# ✅ STEP 2: Get query
query = inputs.get("request", "")
# ✅ STEP 3: Build evaluation prompt
judge_prompt = f"""Evaluate cost accuracy...
Query: {query}
Response: {response_text}
Return JSON: {{"score": 0.0-1.0, "rationale": "explanation"}}
"""
# ✅ STEP 4: Call LLM via Databricks SDK (NOT langchain_databricks)
result = _call_llm_for_scoring(judge_prompt, endpoint="databricks-claude-3-7-sonnet")
# ✅ STEP 5: Return Score object
return Score(
value=result["score"],
rationale=result["rationale"]
)
See references/custom-scorer-patterns.md for complete examples.
CRITICAL: make_judge() Template Variable Constraints
make_judge() and the MLflow Prompt Registry use the same {{ }} template syntax but have different validation rules:
| System | Allowed Variable Names | Validation |
|---|---|---|
Prompt Registry (register_prompt()) |
Any {{ variable }} |
No validation — any name accepted |
make_judge(instructions=...) |
Only 5 allowed (see below) | Strict — raises MlflowException on unknown variables |
make_judge() only permits these 5 template variables:
{{ inputs }}— the eval record'sinputsdict{{ outputs }}— thepredict_fnreturn value{{ trace }}— the MLflow trace frompredict_fn{{ expectations }}— the eval record'sexpectationsdict{{ conversation }}— conversation data (for chat models)
Bidirectional constraint:
- MUST contain at least one of the 5 allowed variables (plain text is rejected)
- MUST NOT contain any other
{{ variable }}names (custom variables are rejected)
# WRONG — custom variables crash make_judge()
"Question: {{question}}\nExpected SQL: {{expected_sql}}\nGenerated SQL: {{genie_sql}}"
# WRONG — no variables at all, also crashes make_judge()
"Evaluate the SQL quality and respond with yes or no."
# CORRECT — uses only allowed variables
"User question: {{ inputs }}\nGenerated SQL: {{ outputs }}\nExpected SQL: {{ expectations }}"
CRITICAL: predict_fn Keyword Argument Contract
mlflow.genai.evaluate() unpacks the inputs dict as keyword arguments when calling predict_fn. The function signature must match the keys in eval_records["inputs"].
# Given eval records with:
eval_records = [{"inputs": {"question": "...", "space_id": "...", "expected_sql": "..."}, ...}]
# WRONG — receives keyword args, not a dict
def predict_fn(inputs: dict) -> dict:
question = inputs["question"] # TypeError or MlflowException
# CORRECT — signature matches input keys
def predict_fn(question: str, expected_sql: str = "", **kwargs) -> dict:
# question and expected_sql are unpacked directly
# space_id, catalog, etc. land in **kwargs (use closure for these)
...
Metric Aliases Quick Reference
CRITICAL: Handle metric name variations across MLflow versions.
# Built-in scorers use different metric names across MLflow versions
METRIC_ALIASES = {
"relevance/mean": ["relevance_to_query/mean"], # MLflow 3.0 vs 3.1
"safety/mean": ["safety/mean"], # No alias needed
"guidelines/mean": ["guidelines/mean"], # No alias needed
}
Why aliases matter:
- MLflow 3.0 uses
"relevance/mean" - MLflow 3.1 uses
"relevance_to_query/mean" - Without aliases, threshold checks fail silently
- Deployment succeeds with failing scores (BAD!)
See references/threshold-checking.md for complete check_thresholds() function.
Foundation Model Endpoints Recommendation
ALWAYS use foundation model endpoints (NOT pay-per-token) for judges.
✅ Recommended Endpoints
endpoints:/databricks-claude-sonnet-4-5(recommended)endpoints:/databricks-meta-llama-3-1-405b-instructendpoints:/databricks-claude-3-7-sonnet
❌ Avoid Pay-Per-Token Endpoints
- Evaluation is high-volume
- Pay-per-token gets expensive fast
- Foundation models included in workspace DBU consumption
See references/custom-scorer-patterns.md for complete endpoint list.
Validation Checklist
Before running agent evaluation:
Dataset & Configuration
- Evaluation dataset loaded with correct schema (request, response columns)
- 4-6 essential guidelines defined (not 8+)
- Run name follows convention:
eval_pre_deploy_YYYYMMDD_HHMMSS - Evaluation experiment set correctly
Custom Scorers (CRITICAL)
- ✅
_extract_response_text()helper included in ALL custom scorers - ✅ Databricks SDK used for LLM calls (NOT
langchain_databricks) - ✅
_call_llm_for_scoring()helper defined and used - Custom judges use
@mlflow.traceand@scorerdecorators - Custom judges return
Scoreobject withvalueandrationale - Foundation model endpoints used (not pay-per-token)
- Temperature = 0.0 for judge consistency
Threshold Checking
- ✅
METRIC_ALIASESdefined for backward compatibility - ✅
check_thresholds()function used (handles aliases) - Thresholds defined for all judges
- Results checked against thresholds before deployment
Reference Files
references/custom-scorer-patterns.md- Complete custom scorer patterns with_call_llm_for_scoring()helperreferences/built-in-judges.md- Built-in judges (relevance, safety, guidelines) with thresholdsreferences/threshold-checking.md-check_thresholds()function with metric aliases supportreferences/evaluation-dataset-patterns.md- Evaluation dataset schema and loading patternsscripts/evaluation_helpers.py- Complete helper functions (_extract_response_text(),_call_llm_for_scoring(),check_thresholds())
References
Official Documentation
Related Skills
ml-pipeline-setup- MLflow model patternsresponses-agent-patterns- ResponsesAgent implementation patterns
Scorer vs Evaluator Semantics
make_judge() returns an InstructionsJudge scorer callable — an object intended for use inside mlflow.genai.evaluate(scorers=[...]). Scorers are not standalone evaluators:
- Scorers are callables that receive structured inputs from
mlflow.genai.evaluate()and returnFeedbackobjects. They have no.evaluate()method. mlflow.genai.evaluate()is the evaluator — it orchestrates the predict function, passes data through scorers, and logs results to MLflow.
For inline/conditional LLM calls outside the mlflow.genai.evaluate() harness (e.g., arbiter conditional scoring, ad-hoc quality checks), use direct LLM calls:
_call_llm_for_scoring()viaw.serving_endpoints.query()(Databricks SDK)- Parse JSON verdicts from the LLM response manually
| Use Case | Correct Approach | Wrong Approach |
|---|---|---|
| Running all judges on benchmark suite | mlflow.genai.evaluate(scorers=[judge1, judge2]) |
Calling each judge manually in a loop |
| Conditional scoring (arbiter fires only on disagreement) | _call_llm_for_scoring(prompt) inside a @scorer |
make_judge().evaluate(data) — no .evaluate() method |
| Quick ad-hoc LLM quality check | w.serving_endpoints.query(...) |
make_judge()(inputs) — wrong call signature |
Common Mistakes
| Mistake | Consequence | Fix |
|---|---|---|
Calling make_judge().evaluate() for standalone scoring |
AttributeError: 'InstructionsJudge' object has no attribute 'evaluate' |
Use _call_llm_for_scoring() for inline/conditional LLM calls, or pass scorers to mlflow.genai.evaluate(scorers=[...]) |