querying-mlflow-metrics

star 51

Fetches aggregated trace metrics (token usage, latency, trace counts, quality evaluations) from MLflow tracking servers. Triggers on requests to show metrics, analyze token usage, view LLM costs, check usage trends, or query trace statistics.

mlflow By mlflow schedule Updated 1/28/2026

name: querying-mlflow-metrics description: Fetches aggregated trace metrics (token usage, latency, trace counts, quality evaluations) from MLflow tracking servers. Triggers on requests to show metrics, analyze token usage, view LLM costs, check usage trends, or query trace statistics.

MLflow Metrics

Run scripts/fetch_metrics.py to query metrics from an MLflow tracking server.

Examples

Token usage summary:

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m total_tokens -a SUM,AVG

Output: AVG: 223.91 SUM: 7613

Hourly token trend (last 24h):

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m total_tokens -a SUM \
    -t 3600 --start-time="-24h" --end-time=now

Output: Time-bucketed token sums per hour

Latency percentiles by trace:

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m latency -a AVG,P95 -d trace_name

Error rate by status:

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m trace_count -a COUNT -d trace_status

Quality scores by evaluator (assessments):

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -v ASSESSMENTS \
    -m assessment_value -a AVG,P50 -d assessment_name

Output: Average and median scores for each evaluator (e.g., correctness, relevance)

Assessment count by name:

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -v ASSESSMENTS \
    -m assessment_count -a COUNT -d assessment_name

JSON output: Add -o json to any command.

Arguments

Arg Required Description
-s, --server Yes MLflow server URL
-x, --experiment-ids Yes Experiment IDs (comma-separated)
-m, --metric Yes trace_count, latency, input_tokens, output_tokens, total_tokens
-a, --aggregations Yes COUNT, SUM, AVG, MIN, MAX, P50, P95, P99
-d, --dimensions No Group by: trace_name, trace_status
-t, --time-interval No Bucket size in seconds (3600=hourly, 86400=daily)
--start-time No -24h, -7d, now, ISO 8601, or epoch ms
--end-time No Same formats as start-time
-o, --output No table (default) or json

For SPANS metrics (span_count, latency), add -v SPANS. For ASSESSMENTS metrics, add -v ASSESSMENTS.

See references/api_reference.md for filter syntax and full API details.

Install via CLI
npx skills add https://github.com/mlflow/skills --skill querying-mlflow-metrics
Repository Details
star Stars 51
call_split Forks 15
navigation Branch main
article Path SKILL.md
More from Creator