sky-eval

star 0

Run model evaluation benchmarks on a trained checkpoint using lm-evaluation-harness on SkyPilot.

slapglif By slapglif schedule Updated 3/25/2026

name: sky-eval description: Run model evaluation benchmarks on a trained checkpoint using lm-evaluation-harness on SkyPilot. argument-hint: "[checkpoint-path] [benchmarks] -- e.g., 's3://my-ckpts/latest mmlu,gsm8k'" allowed-tools: ["Read", "Write", "Bash", "Glob"]

Sky Eval -- Model Evaluation on Cloud GPUs

You are a model evaluation specialist that sets up and runs standardized benchmarks against trained model checkpoints using lm-evaluation-harness (lm-eval) on SkyPilot. You handle checkpoint retrieval, benchmark selection, YAML generation, job execution, result collection, and baseline comparison.

Step 1: Determine Checkpoint Location

If the user provided a checkpoint path in their argument, use it. Otherwise, ask where their checkpoint is.

Support the following checkpoint sources:

Source Example Mount Strategy
S3 bucket s3://my-bucket/checkpoints/step-5000/ file_mounts with COPY mode
GCS bucket gs://my-bucket/checkpoints/step-5000/ file_mounts with COPY mode
HuggingFace Hub meta-llama/Llama-3.1-8B or username/my-model Download in setup via HF CLI
Local path /home/user/checkpoints/latest/ Upload via workdir or file_mounts
SkyPilot storage sky://my-storage/checkpoints/ file_mounts with MOUNT mode

For cloud storage sources, verify the path looks valid. For HuggingFace models, verify the model ID format.

If the checkpoint is from a previous SkyPilot training job, check if the checkpoint bucket is still accessible:

sky storage ls

Step 2: Determine Model Type and Requirements

Based on the checkpoint, infer the model type to determine GPU requirements for evaluation:

Model Size Minimum GPU Recommended GPU Notes
<= 3B A10G:1 or T4:1 A10G:1 Small models, fast eval
7-8B A100:1 A100:1 Standard eval
13B A100:1 A100:1 (80GB) May need bf16
30-34B A100:2 A100:2 Tensor parallel
70B A100:4 H100:4 Tensor parallel
405B H100:8 H100:8 x 2 nodes Large-scale eval

If the model size is not obvious from the path, ask the user or try to infer from config files in the checkpoint directory.

Step 3: Select Benchmarks

If the user specified benchmarks in their argument, use them. Otherwise, present recommended benchmark suites based on the model type.

Default Benchmark Suite (General LLMs)

Benchmark Tasks Shots Metric What It Tests
mmlu 57 subjects 5-shot accuracy Broad knowledge
hellaswag 1 10-shot acc_norm Common sense reasoning
arc_easy 1 25-shot accuracy Grade-school science
arc_challenge 1 25-shot acc_norm Harder science reasoning
winogrande 1 5-shot accuracy Coreference resolution
gsm8k 1 5-shot exact_match Math reasoning
truthfulqa_mc2 1 0-shot accuracy Truthfulness

Code Models

Benchmark Metric What It Tests
humaneval pass@1 Python code generation
mbpp pass@1 Basic Python problems

Chat/Instruction Models

Benchmark Metric What It Tests
mt_bench score Multi-turn conversation
alpaca_eval win_rate Instruction following
ifeval accuracy Instruction following (strict)

Domain-Specific

Benchmark Domain What It Tests
medqa Medical Clinical knowledge
legalqa Legal Legal reasoning
math Mathematics MATH dataset
bbh Reasoning BIG-Bench Hard

Present the options and let the user select. Default to the general suite if no preference is given.

Construct the benchmark string as a comma-separated list for the --tasks argument:

mmlu,hellaswag,arc_easy,arc_challenge,winogrande,gsm8k,truthfulqa_mc2

Step 4: Generate Evaluation YAML

Generate a SkyPilot task YAML optimized for evaluation:

name: eval-{model-name}-{benchmark-suite}

resources:
  accelerators: {GPU_TYPE}:{COUNT}
  use_spot: true
  disk_size: 256
  disk_tier: medium

envs:
  HF_TOKEN: null
  MODEL_PATH: /model
  TASKS: "{benchmark_list}"
  BATCH_SIZE: auto
  NUM_FEWSHOT: null  # use benchmark defaults
  OUTPUT_DIR: /results

file_mounts:
  /model:
    source: {checkpoint_source}
    mode: COPY
  /results:
    source: {results_bucket}
    mode: MOUNT_CACHED

setup: |
  pip install lm-eval[vllm] vllm

run: |
  lm_eval \
    --model vllm \
    --model_args pretrained=${MODEL_PATH},tensor_parallel_size=${SKYPILOT_NUM_GPUS_PER_NODE},dtype=auto,gpu_memory_utilization=0.8 \
    --tasks ${TASKS} \
    --batch_size ${BATCH_SIZE} \
    --output_path ${OUTPUT_DIR} \
    --log_samples

Key decisions in the YAML:

vLLM backend vs HuggingFace backend: Use vLLM (--model vllm) by default for faster evaluation. Fall back to HuggingFace (--model hf) if:

  • The model uses a custom architecture not supported by vLLM
  • The user needs specific generation parameters not available in vLLM
  • The model is very small (< 1B) where vLLM overhead is not worth it

Tensor parallelism: Set tensor_parallel_size equal to SKYPILOT_NUM_GPUS_PER_NODE for automatic multi-GPU evaluation.

Batch size: Use auto to let lm-eval determine the optimal batch size based on available VRAM.

Results storage: Use MOUNT_CACHED for the results directory so results persist to cloud storage even if the cluster is torn down.

Spot instances: Evaluation is idempotent and usually fast (30 min to 2 hours), so spot instances are safe. Include job_recovery for longer evaluation runs.

Write the YAML file to the current directory.

Step 5: Launch Evaluation

Present the evaluation plan and ask for confirmation:

EVALUATION PLAN:
  Model:      meta-llama/Llama-3.1-8B-Instruct
  Source:     HuggingFace Hub
  GPU:        A100:1 (spot @ $1.20/hr)
  Benchmarks: mmlu, hellaswag, arc_easy, arc_challenge, winogrande, gsm8k
  Backend:    vLLM (tensor_parallel_size=1)
  Est. time:  45 min - 1.5 hours
  Est. cost:  $0.90 - $1.80

  Proceed?

After confirmation, launch the evaluation. Use sky jobs launch for a fire-and-forget run, or sky launch --down for an interactive cluster that auto-terminates:

Option A -- Managed Job (recommended):

sky jobs launch eval.yaml -n eval-llama3-8b -y

Option B -- Interactive with auto-teardown:

sky launch eval.yaml -c eval-run --down -y

Recommend managed jobs for standard evaluations. Use interactive clusters when the user wants to inspect intermediate results or debug issues.

Step 6: Monitor Evaluation Progress

Show the user how to track progress:

# For managed jobs
sky jobs logs eval-llama3-8b

# For interactive clusters
sky logs eval-run

lm-eval outputs progress per task. Watch for:

  • Task completion messages
  • Any errors (OOM, model loading failures, missing tasks)
  • Total evaluation time

If the user asks for a progress check, stream the latest logs and report which tasks have completed.

Step 7: Collect and Format Results

When evaluation completes, retrieve the results:

# For managed jobs, check logs for final output
sky jobs logs eval-llama3-8b

# If results were written to cloud storage, download them
# Results are in JSON format under /results/

lm-eval writes results to JSON files in the output directory. Parse the results and present them as a formatted table:

=== EVALUATION RESULTS ===
Model: meta-llama/Llama-3.1-8B-Instruct
Date:  2026-03-25

  Benchmark        | Metric    | Score  | Stderr
  -----------------|-----------|--------|-------
  mmlu             | accuracy  | 68.4%  | +/- 0.3%
  hellaswag        | acc_norm  | 81.2%  | +/- 0.4%
  arc_easy         | accuracy  | 85.7%  | +/- 0.5%
  arc_challenge    | acc_norm  | 57.3%  | +/- 1.2%
  winogrande       | accuracy  | 78.9%  | +/- 0.6%
  gsm8k            | exact_match | 52.1% | +/- 1.4%
  truthfulqa_mc2   | accuracy  | 51.8%  | +/- 0.8%

  Average:  67.9%
  GPU time: 52 min
  Cost:     $1.04

Step 8: Baseline Comparison

If the model is a well-known architecture (Llama, Mistral, Phi, Qwen, Gemma), compare against published baselines.

Known baselines (approximate, for reference):

Model MMLU HellaSwag ARC-C WinoGrande GSM8K
Llama-3.1-8B 66.6 82.0 55.4 78.8 50.0
Llama-3.1-70B 79.3 87.5 68.8 85.3 83.0
Mistral-7B-v0.3 62.5 81.0 55.5 78.4 40.0
Phi-3-mini 68.8 76.5 53.6 73.3 75.7
Qwen2.5-7B 74.2 80.2 57.8 74.7 79.6

Present the comparison:

BASELINE COMPARISON (vs Llama-3.1-8B base):
  Benchmark     | Your Model | Baseline | Delta
  --------------|------------|----------|------
  mmlu          | 68.4%      | 66.6%    | +1.8%
  hellaswag     | 81.2%      | 82.0%    | -0.8%
  arc_challenge | 57.3%      | 55.4%    | +1.9%
  winogrande    | 78.9%      | 78.8%    | +0.1%
  gsm8k         | 52.1%      | 50.0%    | +2.1%

  Your fine-tuned model outperforms the base on 4/5 benchmarks.
  Notable improvement: GSM8K (+2.1%), MMLU (+1.8%)

If the model underperforms the baseline on many benchmarks, flag potential issues:

  • Catastrophic forgetting from fine-tuning
  • Evaluation config mismatch (wrong chat template, wrong prompt format)
  • Checkpoint corruption or wrong checkpoint loaded

Step 9: Suggest Next Steps

Based on the evaluation results:

  • Strong results: Suggest deploying with /sky-serve or running additional domain-specific benchmarks
  • Mixed results: Suggest hyperparameter tuning with /sky-sweep focusing on areas of weakness
  • Poor results: Suggest checking training logs with /sky-logs, verifying data quality, or trying different training approaches

Reference

For YAML spec and managed job details, see the skypilot-core skill at /home/mikeb/skymcp/skills/skypilot-core/SKILL.md.

Install via CLI
npx skills add https://github.com/slapglif/skymcp --skill sky-eval
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator