model-evaluation

star 0

Use when evaluating model quality, running benchmarks, comparing checkpoints, selecting evaluation tasks, or preparing leaderboard submissions - covers lm-evaluation-harness, lighteval, HELM, benchmark selection, and SkyPilot eval job patterns

slapglif By slapglif schedule Updated 3/25/2026

name: model-evaluation description: Use when evaluating model quality, running benchmarks, comparing checkpoints, selecting evaluation tasks, or preparing leaderboard submissions - covers lm-evaluation-harness, lighteval, HELM, benchmark selection, and SkyPilot eval job patterns

Model Evaluation and Benchmarking

Overview

Evaluate trained models against standardized benchmarks to measure quality, detect regressions, and compare checkpoints. Run evaluations on cloud GPUs via SkyPilot to avoid blocking local resources.

Core principle: Every checkpoint decision (keep/discard/deploy) requires quantitative evaluation against a known baseline. Never ship a model without benchmarking it.

When to Use

  • After training completes or a checkpoint is saved
  • Comparing two model versions (A/B)
  • Preparing a model for deployment or leaderboard submission
  • Validating that fine-tuning did not degrade base capabilities
  • Selecting which benchmarks matter for a given use case

Do not use for:

  • Vibes-based evaluation (use MT-Bench or human eval instead)
  • Latency/throughput testing (use inference benchmarking tools)
  • Training loss curves (use training-monitoring skill)

Primary Tools

lm-evaluation-harness (EleutherAI)

The standard. Backend for the Open LLM Leaderboard. 200+ tasks, YAML config, chat template support.

# Install
pip install lm-eval

# Run evaluation
lm_eval --model hf \
  --model_args pretrained=/path/to/model \
  --tasks mmlu,hellaswag,arc_easy,arc_challenge,winogrande,gsm8k \
  --batch_size auto \
  --output_path /results/

# List available tasks
lm_eval --tasks list

# Use chat template for instruction-tuned models
lm_eval --model hf \
  --model_args pretrained=/path/to/model \
  --tasks mmlu \
  --apply_chat_template \
  --batch_size auto

# VLLM backend for faster inference
lm_eval --model vllm \
  --model_args pretrained=/path/to/model,tensor_parallel_size=2 \
  --tasks mmlu \
  --batch_size auto

Key flags:

  • --batch_size auto -- auto-detect max batch for available VRAM
  • --num_fewshot N -- override default few-shot count
  • --limit 100 -- run only 100 samples per task (fast debugging)
  • --log_samples -- save per-sample predictions for error analysis
  • --apply_chat_template -- required for chat/instruct models

lighteval (HuggingFace)

Lighter weight, tighter HF Hub integration, faster iteration cycle.

pip install lighteval

# Run evaluation
lighteval accelerate \
  --model_args "pretrained=/path/to/model" \
  --tasks "leaderboard|mmlu|5" \
  --output_dir /results/

# Evaluate model directly from HF Hub
lighteval accelerate \
  --model_args "pretrained=meta-llama/Llama-3-8B" \
  --tasks "leaderboard|hellaswag|10"

When to prefer lighteval:

  • Quick iteration during training (lighter overhead)
  • HF Hub models (native integration)
  • Custom task definitions (simpler YAML format)

HELM (Stanford)

Holistic evaluation: accuracy, calibration, robustness, fairness, bias, toxicity, efficiency. Use when evaluation must cover more than accuracy.

pip install crfm-helm
helm-run --run-entries mmlu:model=hf/my-model --suite my-eval
helm-summarize --suite my-eval

When to prefer HELM:

  • Safety-critical deployments
  • Bias and fairness audits
  • Multi-dimensional evaluation (not just accuracy)

Benchmark Selection Guide

Use Case Benchmarks Why
General knowledge MMLU, ARC, HellaSwag, WinoGrande Broad coverage of reasoning and knowledge
Math/reasoning GSM8K, MATH, BBH Chain-of-thought and multi-step
Code generation HumanEval, MBPP, MultiPL-E Functional correctness
Instruction following MT-Bench, AlpacaEval, IFEval Chat quality and compliance
Safety TruthfulQA, ToxiGen, BBQ Hallucination and toxicity
Long context RULER, Needle-in-Haystack Context window utilization

For detailed benchmark descriptions and scoring methodology, see references/benchmark-guide.md.

SkyPilot Eval Job Pattern

Run evaluation on cloud GPUs without blocking local machines.

name: model-eval
resources:
  accelerators: A100:1
  disk_size: 200

file_mounts:
  /model:
    source: s3://my-checkpoints/latest/
  /results:
    name: eval-results
    store: s3
    mode: MOUNT

setup: |
  pip install lm-eval vllm

run: |
  lm_eval --model vllm \
    --model_args pretrained=/model,tensor_parallel_size=1 \
    --tasks mmlu,hellaswag,arc_easy,arc_challenge,winogrande,gsm8k \
    --batch_size auto \
    --output_path /results/${SKYPILOT_TASK_ID}/

For multi-GPU eval, VLLM backend recipes, and custom task configs, see references/eval-recipes.md.

Post-Training Eval Pipeline

train checkpoint
      |
      v
  eval on standard suite
      |
      v
  compare to baseline
      |
      +---> better? KEEP, update baseline
      |
      +---> worse?  DISCARD, investigate
      |
      +---> mixed?  eval on domain-specific tasks, decide

Automation pattern:

# In training script callback or post-training step
BASELINE_MMLU=0.65
RESULT=$(lm_eval --model hf --model_args pretrained=/ckpt \
  --tasks mmlu --batch_size auto --output_path /tmp/eval/ \
  | grep "mmlu" | awk '{print $NF}')

if (( $(echo "$RESULT > $BASELINE_MMLU" | bc -l) )); then
  echo "KEEP: MMLU $RESULT > baseline $BASELINE_MMLU"
  cp -r /ckpt /checkpoints/promoted/
else
  echo "DISCARD: MMLU $RESULT <= baseline $BASELINE_MMLU"
fi

Common Mistakes

Mistake Fix
Evaluating chat model without --apply_chat_template Scores will be much lower than actual capability. Always use chat template for instruct models.
Using --limit in final eval Limit is for debugging only. Full eval required for publication or decisions.
Comparing different few-shot counts Always match --num_fewshot between runs.
Ignoring confidence intervals Small differences (< 1-2%) may be noise. Run multiple seeds or use full eval sets.
Running on CPU Evaluation is slow on CPU. Use GPU, especially with VLLM backend.
Not pinning lm-eval version Different versions can produce different scores. Pin the version in requirements.

Quick Reference

# Standard eval suite (Open LLM Leaderboard v2 tasks)
lm_eval --model hf --model_args pretrained=MODEL \
  --tasks mmlu,arc_challenge,hellaswag,winogrande,truthfulqa_mc2,gsm8k \
  --batch_size auto --output_path /results/

# Fast sanity check (100 samples per task)
lm_eval --model hf --model_args pretrained=MODEL \
  --tasks mmlu --limit 100 --batch_size auto

# Compare two checkpoints
lm_eval --model hf --model_args pretrained=CKPT_A \
  --tasks mmlu --output_path /results/a/
lm_eval --model hf --model_args pretrained=CKPT_B \
  --tasks mmlu --output_path /results/b/
diff <(cat /results/a/results.json) <(cat /results/b/results.json)

# List all tasks matching a pattern
lm_eval --tasks list | grep -i "math"
Install via CLI
npx skills add https://github.com/slapglif/skymcp --skill model-evaluation
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator