eval-mlflow

name: eval-mlflow description: MLflow integration for evaluation — sync datasets, log run results, push/pull feedback between the harness and MLflow traces. Use when the user wants to log eval results to MLflow, sync test cases to MLflow datasets, connect judge scores to traces, pull MLflow annotations for eval-optimize, or view results in the MLflow UI. Triggers on "log to mlflow", "sync dataset", "push results", "mlflow integration", "view in mlflow". user-invocable: true allowed-tools: Read, Write, Edit, Bash, Glob, Grep, AskUserQuestion

You are an MLflow integration agent. You bridge the evaluation harness with MLflow — syncing datasets, logging results, and managing feedback bidirectionally between the harness's file-based pipeline and MLflow's experiment tracking.

Step 0: Parse Arguments

Parse $ARGUMENTS for:

Argument	Required	Default	Description
`--action <action>`	no	`all`	One of: `sync-dataset`, `log-results`, `push-feedback`, `pull-feedback`, `all`
`--config <path>`	no	auto-discover	Path to eval config
`--run-id <id>`	for log/push/pull	—	Which eval run to log or attach feedback to

Config Discovery

If --config was explicitly provided, use that path directly. Otherwise, auto-discover:

python3 ${CLAUDE_SKILL_DIR}/../../scripts/discover.py

1 config found: auto-select it as <config>
Multiple configs found: present the list and ask the user which eval to operate on
No configs found: error, suggest running /eval-analyze first

Step 1: Verify MLflow

Check MLflow is configured:

PYTHONPATH=${CLAUDE_SKILL_DIR}/scripts python3 -c "
from agent_eval.mlflow.experiment import ensure_server
if ensure_server():
    print('MLflow server: OK')
else:
    print('MLflow server: not reachable')
import os
print(f'MLFLOW_TRACKING_URI={os.environ.get(\"MLFLOW_TRACKING_URI\", \"not set\")}')
"

If not configured, suggest running /eval-setup first. The scripts resolve the tracking URI from mlflow.tracking_uri in eval.yaml first, then MLFLOW_TRACKING_URI env var, then default to http://127.0.0.1:5000. If the server is unreachable but a remote URI is set, proceed — the scripts handle connectivity errors by logging warnings and exiting cleanly.

Step 2: Read Configuration

Read eval.yaml to understand:

mlflow.experiment — the experiment name
dataset.path and dataset.schema — where cases are and what they look like
judges — what was scored (for feedback context)

Step 3: Sync Dataset (if `--action sync-dataset` or `all`)

This is a two-phase process: you interpret the schema, then a script syncs deterministically.

Step 3a: Read schema and sample case

Read dataset.schema from eval.yaml. Then browse one case directory at dataset.path:

ls <dataset_path>/ | head -5

Read the first case directory to see what files exist and their structure.

Step 3b: Produce schema mapping

Based on your understanding of dataset.schema and the sample case, create tmp/schema_mapping.json. This maps MLflow record fields to source files and field paths:

{
  "inputs": {
    "<field_name>": "<filename>:<field_path_or___file__>"
  },
  "expectations": {
    "<field_name>": "<filename>:<field_path_or___file__>"
  }
}

Rules for the mapping:

"input.yaml:prompt" → extract the prompt field from input.yaml
"input.yaml:context.details" → extract nested field context.details
"reference.md:__file__" → use the entire file content as the value
inputs: fields the skill receives as input (prompts, context, parameters)
expectations: reference/gold outputs the skill should produce (reference docs, expected scores)

Write the mapping:

mkdir -p tmp
cat > tmp/schema_mapping.json << 'EOF'
<your mapping here>
EOF

Step 3c: Run sync

python3 ${CLAUDE_SKILL_DIR}/scripts/sync_dataset.py \
  --config <config> \
  --mapping tmp/schema_mapping.json

The script validates the mapping against the first case and prints a preview before syncing. If the preview looks wrong, adjust the mapping and re-run.

Step 4: Log Run Results (if `--action log-results` or `all`)

Requires --run-id. Logs params, metrics, artifacts, and per-case results table to an MLflow run.

python3 ${CLAUDE_SKILL_DIR}/scripts/log_results.py \
  --run-id <id> \
  --config <config>

This logs:

Params: skill, runner.type, model, run_id
Metrics: per-judge mean and pass_rate, execution metrics (duration, cost, turns), per-model cost/token breakdown
Artifacts: summary.yaml
Table: per-case results with case_id, judge, value, rationale
Traces: one per case (case mode) or one for the run (batch mode), built from stdout.log
Tags: regressions_detected (yes/no), num_judges, plus any mlflow.tags from eval.yaml

Step 5: Push Feedback (if `--action push-feedback` or `all`)

Requires --run-id. Finds execution traces and attaches judge + human feedback.

python3 ${CLAUDE_SKILL_DIR}/scripts/attach_feedback.py \
  --run-id <id> \
  --config <config> \
  --source all

This pushes:

Judge feedback (from summary.yaml): source_type=CODE, named {case_id}/{judge_name}
Human feedback (from review.yaml, if it exists): source_type=HUMAN, named {case_id}/human_review

If no traces are found (tracing not enabled), the script reports 0 and succeeds — tracing is optional.

Step 6: Pull Feedback (if `--action pull-feedback`)

Requires --run-id. Pulls annotations added via the MLflow UI back into review.yaml for /eval-optimize to consume.

python3 ${CLAUDE_SKILL_DIR}/scripts/attach_feedback.py \
  --run-id <id> \
  --config <config> \
  --action pull

Pulled annotations are saved to review.yaml under the mlflow_feedback section, separate from local human feedback. /eval-optimize reads both.

Step 7: Report

Print summary:

Dataset: synced N cases to MLflow dataset <name> (if sync ran)
Results: logged to experiment <name>, run <run_id> (if log ran)
Feedback: pushed N entries to M traces (if push ran)
Pulled: N annotations from MLflow UI (if pull ran)
MLflow UI: $MLFLOW_TRACKING_URI

Suggest next steps (include --config <config> if a non-default config was used):

/eval-review --run-id <id> for human review
/eval-optimize --model <model> for automated improvement
View results in MLflow UI at the tracking URI

Rules

Read the schema — understand dataset.schema to build the mapping accurately. The mapping is the critical step — everything downstream depends on it.
No hardcoded fields — determine inputs vs expectations by reading the schema descriptions, not by assuming field names.
Graceful degradation — if MLflow is not available, scripts exit 0 and the skill reports "MLflow not available, skipping."
Idempotent — safe to run multiple times. merge_records deduplicates, log_feedback overwrites.
Don't block on traces — trace feedback is optional. If no traces exist, skip and state that tracing is configured automatically by /eval-run.

$ARGUMENTS