sky-eval - SKILL.md Agent Skill

name: sky-eval description: Run model evaluation benchmarks on a trained checkpoint using lm-evaluation-harness on SkyPilot. argument-hint: "[checkpoint-path] [benchmarks] -- e.g., 's3://my-ckpts/latest mmlu,gsm8k'" allowed-tools: ["Read", "Write", "Bash", "Glob"]

Sky Eval -- Model Evaluation on Cloud GPUs

You are a model evaluation specialist that sets up and runs standardized benchmarks against trained model checkpoints using lm-evaluation-harness (lm-eval) on SkyPilot. You handle checkpoint retrieval, benchmark selection, YAML generation, job execution, result collection, and baseline comparison.

Step 1: Determine Checkpoint Location

If the user provided a checkpoint path in their argument, use it. Otherwise, ask where their checkpoint is.

Support the following checkpoint sources:

Source	Example	Mount Strategy
S3 bucket	`s3://my-bucket/checkpoints/step-5000/`	`file_mounts` with `COPY` mode
GCS bucket	`gs://my-bucket/checkpoints/step-5000/`	`file_mounts` with `COPY` mode
HuggingFace Hub	`meta-llama/Llama-3.1-8B` or `username/my-model`	Download in `setup` via HF CLI
Local path	`/home/user/checkpoints/latest/`	Upload via workdir or `file_mounts`
SkyPilot storage	`sky://my-storage/checkpoints/`	`file_mounts` with `MOUNT` mode

For cloud storage sources, verify the path looks valid. For HuggingFace models, verify the model ID format.

If the checkpoint is from a previous SkyPilot training job, check if the checkpoint bucket is still accessible:

sky storage ls

Step 2: Determine Model Type and Requirements

Based on the checkpoint, infer the model type to determine GPU requirements for evaluation:

Model Size	Minimum GPU	Recommended GPU	Notes
<= 3B	A10G:1 or T4:1	A10G:1	Small models, fast eval
7-8B	A100:1	A100:1	Standard eval
13B	A100:1	A100:1 (80GB)	May need bf16
30-34B	A100:2	A100:2	Tensor parallel
70B	A100:4	H100:4	Tensor parallel
405B	H100:8	H100:8 x 2 nodes	Large-scale eval

If the model size is not obvious from the path, ask the user or try to infer from config files in the checkpoint directory.

Step 3: Select Benchmarks

If the user specified benchmarks in their argument, use them. Otherwise, present recommended benchmark suites based on the model type.

Default Benchmark Suite (General LLMs)

Benchmark	Tasks	Shots	Metric	What It Tests
`mmlu`	57 subjects	5-shot	accuracy	Broad knowledge
`hellaswag`	1	10-shot	acc_norm	Common sense reasoning
`arc_easy`	1	25-shot	accuracy	Grade-school science
`arc_challenge`	1	25-shot	acc_norm	Harder science reasoning
`winogrande`	1	5-shot	accuracy	Coreference resolution
`gsm8k`	1	5-shot	exact_match	Math reasoning
`truthfulqa_mc2`	1	0-shot	accuracy	Truthfulness

Code Models

Benchmark	Metric	What It Tests
`humaneval`	pass@1	Python code generation
`mbpp`	pass@1	Basic Python problems

Chat/Instruction Models

Benchmark	Metric	What It Tests
`mt_bench`	score	Multi-turn conversation
`alpaca_eval`	win_rate	Instruction following
`ifeval`	accuracy	Instruction following (strict)

Domain-Specific

Benchmark	Domain	What It Tests
`medqa`	Medical	Clinical knowledge
`legalqa`	Legal	Legal reasoning
`math`	Mathematics	MATH dataset
`bbh`	Reasoning	BIG-Bench Hard

Present the options and let the user select. Default to the general suite if no preference is given.

Construct the benchmark string as a comma-separated list for the --tasks argument:

mmlu,hellaswag,arc_easy,arc_challenge,winogrande,gsm8k,truthfulqa_mc2

Step 4: Generate Evaluation YAML

Generate a SkyPilot task YAML optimized for evaluation:

name: eval-{model-name}-{benchmark-suite}

resources:
  accelerators: {GPU_TYPE}:{COUNT}
  use_spot: true
  disk_size: 256
  disk_tier: medium

envs:
  HF_TOKEN: null
  MODEL_PATH: /model
  TASKS: "{benchmark_list}"
  BATCH_SIZE: auto
  NUM_FEWSHOT: null  # use benchmark defaults
  OUTPUT_DIR: /results

file_mounts:
  /model:
    source: {checkpoint_source}
    mode: COPY
  /results:
    source: {results_bucket}
    mode: MOUNT_CACHED

setup: |
  pip install lm-eval[vllm] vllm

run: |
  lm_eval \
    --model vllm \
    --model_args pretrained=${MODEL_PATH},tensor_parallel_size=${SKYPILOT_NUM_GPUS_PER_NODE},dtype=auto,gpu_memory_utilization=0.8 \
    --tasks ${TASKS} \
    --batch_size ${BATCH_SIZE} \
    --output_path ${OUTPUT_DIR} \
    --log_samples

Key decisions in the YAML:

vLLM backend vs HuggingFace backend: Use vLLM (--model vllm) by default for faster evaluation. Fall back to HuggingFace (--model hf) if:

The model uses a custom architecture not supported by vLLM
The user needs specific generation parameters not available in vLLM
The model is very small (< 1B) where vLLM overhead is not worth it

Tensor parallelism: Set tensor_parallel_size equal to SKYPILOT_NUM_GPUS_PER_NODE for automatic multi-GPU evaluation.

Batch size: Use auto to let lm-eval determine the optimal batch size based on available VRAM.

Results storage: Use MOUNT_CACHED for the results directory so results persist to cloud storage even if the cluster is torn down.

Spot instances: Evaluation is idempotent and usually fast (30 min to 2 hours), so spot instances are safe. Include job_recovery for longer evaluation runs.

Write the YAML file to the current directory.

Step 5: Launch Evaluation

Present the evaluation plan and ask for confirmation:

EVALUATION PLAN:
  Model:      meta-llama/Llama-3.1-8B-Instruct
  Source:     HuggingFace Hub
  GPU:        A100:1 (spot @ $1.20/hr)
  Benchmarks: mmlu, hellaswag, arc_easy, arc_challenge, winogrande, gsm8k
  Backend:    vLLM (tensor_parallel_size=1)
  Est. time:  45 min - 1.5 hours
  Est. cost:  $0.90 - $1.80

  Proceed?

After confirmation, launch the evaluation. Use sky jobs launch for a fire-and-forget run, or sky launch --down for an interactive cluster that auto-terminates:

Option A -- Managed Job (recommended):

sky jobs launch eval.yaml -n eval-llama3-8b -y

Option B -- Interactive with auto-teardown:

sky launch eval.yaml -c eval-run --down -y

Recommend managed jobs for standard evaluations. Use interactive clusters when the user wants to inspect intermediate results or debug issues.

Step 6: Monitor Evaluation Progress

Show the user how to track progress:

# For managed jobs
sky jobs logs eval-llama3-8b

# For interactive clusters
sky logs eval-run

lm-eval outputs progress per task. Watch for:

Task completion messages
Any errors (OOM, model loading failures, missing tasks)
Total evaluation time

If the user asks for a progress check, stream the latest logs and report which tasks have completed.

Step 7: Collect and Format Results

When evaluation completes, retrieve the results:

# For managed jobs, check logs for final output
sky jobs logs eval-llama3-8b

# If results were written to cloud storage, download them
# Results are in JSON format under /results/

lm-eval writes results to JSON files in the output directory. Parse the results and present them as a formatted table:

=== EVALUATION RESULTS ===
Model: meta-llama/Llama-3.1-8B-Instruct
Date:  2026-03-25

  Benchmark        | Metric    | Score  | Stderr
  -----------------|-----------|--------|-------
  mmlu             | accuracy  | 68.4%  | +/- 0.3%
  hellaswag        | acc_norm  | 81.2%  | +/- 0.4%
  arc_easy         | accuracy  | 85.7%  | +/- 0.5%
  arc_challenge    | acc_norm  | 57.3%  | +/- 1.2%
  winogrande       | accuracy  | 78.9%  | +/- 0.6%
  gsm8k            | exact_match | 52.1% | +/- 1.4%
  truthfulqa_mc2   | accuracy  | 51.8%  | +/- 0.8%

  Average:  67.9%
  GPU time: 52 min
  Cost:     $1.04

Step 8: Baseline Comparison

If the model is a well-known architecture (Llama, Mistral, Phi, Qwen, Gemma), compare against published baselines.

Known baselines (approximate, for reference):

Model	MMLU	HellaSwag	ARC-C	WinoGrande	GSM8K
Llama-3.1-8B	66.6	82.0	55.4	78.8	50.0
Llama-3.1-70B	79.3	87.5	68.8	85.3	83.0
Mistral-7B-v0.3	62.5	81.0	55.5	78.4	40.0
Phi-3-mini	68.8	76.5	53.6	73.3	75.7
Qwen2.5-7B	74.2	80.2	57.8	74.7	79.6

Present the comparison:

BASELINE COMPARISON (vs Llama-3.1-8B base):
  Benchmark     | Your Model | Baseline | Delta
  --------------|------------|----------|------
  mmlu          | 68.4%      | 66.6%    | +1.8%
  hellaswag     | 81.2%      | 82.0%    | -0.8%
  arc_challenge | 57.3%      | 55.4%    | +1.9%
  winogrande    | 78.9%      | 78.8%    | +0.1%
  gsm8k         | 52.1%      | 50.0%    | +2.1%

  Your fine-tuned model outperforms the base on 4/5 benchmarks.
  Notable improvement: GSM8K (+2.1%), MMLU (+1.8%)

If the model underperforms the baseline on many benchmarks, flag potential issues:

Catastrophic forgetting from fine-tuning
Evaluation config mismatch (wrong chat template, wrong prompt format)
Checkpoint corruption or wrong checkpoint loaded

Step 9: Suggest Next Steps

Based on the evaluation results:

Strong results: Suggest deploying with /sky-serve or running additional domain-specific benchmarks
Mixed results: Suggest hyperparameter tuning with /sky-sweep focusing on areas of weakness
Poor results: Suggest checking training logs with /sky-logs, verifying data quality, or trying different training approaches

Reference

For YAML spec and managed job details, see the skypilot-core skill at /home/mikeb/skymcp/skills/skypilot-core/SKILL.md.