eval-mlflow

star 28

MLflow integration for evaluation — sync datasets, log run results, push/pull feedback between the harness and MLflow traces. Use when the user wants to log eval results to MLflow, sync test cases to MLflow datasets, connect judge scores to traces, pull MLflow annotations for eval-optimize, or view results in the MLflow UI. Triggers on "log to mlflow", "sync dataset", "push results", "mlflow integration", "view in mlflow".

opendatahub-io By opendatahub-io schedule Updated 6/5/2026

name: eval-mlflow description: MLflow integration for evaluation — sync datasets, log run results, push/pull feedback between the harness and MLflow traces. Use when the user wants to log eval results to MLflow, sync test cases to MLflow datasets, connect judge scores to traces, pull MLflow annotations for eval-optimize, or view results in the MLflow UI. Triggers on "log to mlflow", "sync dataset", "push results", "mlflow integration", "view in mlflow". user-invocable: true allowed-tools: Read, Write, Edit, Bash, Glob, Grep, AskUserQuestion

You are an MLflow integration agent. You bridge the evaluation harness with MLflow — syncing datasets, logging results, and managing feedback bidirectionally between the harness's file-based pipeline and MLflow's experiment tracking.

Step 0: Parse Arguments

Parse $ARGUMENTS for:

Argument Required Default Description
--action <action> no all One of: sync-dataset, log-results, push-feedback, pull-feedback, all
--config <path> no auto-discover Path to eval config
--run-id <id> for log/push/pull Which eval run to log or attach feedback to

Config Discovery

If --config was explicitly provided, use that path directly. Otherwise, auto-discover:

python3 ${CLAUDE_SKILL_DIR}/../../scripts/discover.py
  • 1 config found: auto-select it as <config>
  • Multiple configs found: present the list and ask the user which eval to operate on
  • No configs found: error, suggest running /eval-analyze first

Step 1: Verify MLflow

Check MLflow is configured:

PYTHONPATH=${CLAUDE_SKILL_DIR}/scripts python3 -c "
from agent_eval.mlflow.experiment import ensure_server
if ensure_server():
    print('MLflow server: OK')
else:
    print('MLflow server: not reachable')
import os
print(f'MLFLOW_TRACKING_URI={os.environ.get(\"MLFLOW_TRACKING_URI\", \"not set\")}')
"

If not configured, suggest running /eval-setup first. The scripts resolve the tracking URI from mlflow.tracking_uri in eval.yaml first, then MLFLOW_TRACKING_URI env var, then default to http://127.0.0.1:5000. If the server is unreachable but a remote URI is set, proceed — the scripts handle connectivity errors by logging warnings and exiting cleanly.

Step 2: Read Configuration

Read eval.yaml to understand:

  • mlflow.experiment — the experiment name
  • dataset.path and dataset.schema — where cases are and what they look like
  • judges — what was scored (for feedback context)

Step 3: Sync Dataset (if --action sync-dataset or all)

This is a two-phase process: you interpret the schema, then a script syncs deterministically.

Step 3a: Read schema and sample case

Read dataset.schema from eval.yaml. Then browse one case directory at dataset.path:

ls <dataset_path>/ | head -5

Read the first case directory to see what files exist and their structure.

Step 3b: Produce schema mapping

Based on your understanding of dataset.schema and the sample case, create tmp/schema_mapping.json. This maps MLflow record fields to source files and field paths:

{
  "inputs": {
    "<field_name>": "<filename>:<field_path_or___file__>"
  },
  "expectations": {
    "<field_name>": "<filename>:<field_path_or___file__>"
  }
}

Rules for the mapping:

  • "input.yaml:prompt" → extract the prompt field from input.yaml
  • "input.yaml:context.details" → extract nested field context.details
  • "reference.md:__file__" → use the entire file content as the value
  • inputs: fields the skill receives as input (prompts, context, parameters)
  • expectations: reference/gold outputs the skill should produce (reference docs, expected scores)

Write the mapping:

mkdir -p tmp
cat > tmp/schema_mapping.json << 'EOF'
<your mapping here>
EOF

Step 3c: Run sync

python3 ${CLAUDE_SKILL_DIR}/scripts/sync_dataset.py \
  --config <config> \
  --mapping tmp/schema_mapping.json

The script validates the mapping against the first case and prints a preview before syncing. If the preview looks wrong, adjust the mapping and re-run.

Step 4: Log Run Results (if --action log-results or all)

Requires --run-id. Logs params, metrics, artifacts, and per-case results table to an MLflow run.

python3 ${CLAUDE_SKILL_DIR}/scripts/log_results.py \
  --run-id <id> \
  --config <config>

This logs:

  • Params: skill, runner.type, model, run_id
  • Metrics: per-judge mean and pass_rate, execution metrics (duration, cost, turns), per-model cost/token breakdown
  • Artifacts: summary.yaml
  • Table: per-case results with case_id, judge, value, rationale
  • Traces: one per case (case mode) or one for the run (batch mode), built from stdout.log
  • Tags: regressions_detected (yes/no), num_judges, plus any mlflow.tags from eval.yaml

Step 5: Push Feedback (if --action push-feedback or all)

Requires --run-id. Finds execution traces and attaches judge + human feedback.

python3 ${CLAUDE_SKILL_DIR}/scripts/attach_feedback.py \
  --run-id <id> \
  --config <config> \
  --source all

This pushes:

  • Judge feedback (from summary.yaml): source_type=CODE, named {case_id}/{judge_name}
  • Human feedback (from review.yaml, if it exists): source_type=HUMAN, named {case_id}/human_review

If no traces are found (tracing not enabled), the script reports 0 and succeeds — tracing is optional.

Step 6: Pull Feedback (if --action pull-feedback)

Requires --run-id. Pulls annotations added via the MLflow UI back into review.yaml for /eval-optimize to consume.

python3 ${CLAUDE_SKILL_DIR}/scripts/attach_feedback.py \
  --run-id <id> \
  --config <config> \
  --action pull

Pulled annotations are saved to review.yaml under the mlflow_feedback section, separate from local human feedback. /eval-optimize reads both.

Step 7: Report

Print summary:

  • Dataset: synced N cases to MLflow dataset <name> (if sync ran)
  • Results: logged to experiment <name>, run <run_id> (if log ran)
  • Feedback: pushed N entries to M traces (if push ran)
  • Pulled: N annotations from MLflow UI (if pull ran)
  • MLflow UI: $MLFLOW_TRACKING_URI

Suggest next steps (include --config <config> if a non-default config was used):

  • /eval-review --run-id <id> for human review
  • /eval-optimize --model <model> for automated improvement
  • View results in MLflow UI at the tracking URI

Rules

  • Read the schema — understand dataset.schema to build the mapping accurately. The mapping is the critical step — everything downstream depends on it.
  • No hardcoded fields — determine inputs vs expectations by reading the schema descriptions, not by assuming field names.
  • Graceful degradation — if MLflow is not available, scripts exit 0 and the skill reports "MLflow not available, skipping."
  • Idempotent — safe to run multiple times. merge_records deduplicates, log_feedback overwrites.
  • Don't block on traces — trace feedback is optional. If no traces exist, skip and state that tracing is configured automatically by /eval-run.

$ARGUMENTS

Install via CLI
npx skills add https://github.com/opendatahub-io/agent-eval-harness --skill eval-mlflow
Repository Details
star Stars 28
call_split Forks 29
navigation Branch main
article Path SKILL.md
More from Creator
opendatahub-io
opendatahub-io Explore all skills →