slm-lab-benchmark - SKILL.md Agent Skill

name: slm-lab-benchmark description: Run SLM-Lab deep RL benchmarks, monitor dstack jobs, extract results, and update BENCHMARKS.md. Use when asked to run benchmarks, check run status, extract scores, update benchmark tables, or generate plots.

SLM-Lab Benchmark Skill

Critical Rules

NEVER push to remote without explicit user permission
ONLY train runs in BENCHMARKS.md — never search results
Respect Settings line for each env (max_frame, num_envs, etc.)
Use ${max_frame} variable in specs — never hardcode
Runs must complete in <6h (dstack max_duration)
Max 10 concurrent dstack runs — launch in batches of 10, wait for capacity/completion before launching more. Never submit all runs at once; dstack capacity is limited and mass submissions cause "no offers" failures

Per-Run Intake Checklist

Every completed run MUST go through ALL of these steps. No exceptions. Do not skip any step.

When a run completes (dstack ps shows exited (0)):

Extract score: dstack logs NAME | grep "trial_metrics" → get total_reward_ma
Find HF folder name: dstack logs NAME 2>&1 | grep "Uploading data/" → extract folder name from the upload log line
Update table score in BENCHMARKS.md
Update table HF link: [FOLDER](https://huggingface.co/datasets/SLM-Lab/benchmark-dev/tree/main/data/FOLDER)
Pull HF data locally: source .env && huggingface-cli download SLM-Lab/benchmark-dev --local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"
Generate plot: List ALL data folders for that env (ls data/benchmark-dev/data/ | grep -i envname), then generate with ONLY the folders matching BENCHMARKS.md entries:
```
uv run slm-lab plot -t "EnvName" -d data/benchmark-dev/data -f FOLDER1,FOLDER2,...
```
NOTE: -d sets the base data dir, -f takes folder names (NOT full paths). If some folders are in data/ (local runs) and some in data/benchmark-dev/data/, use data/ as base (it has the info/ subfolder needed for metrics).
Verify plot exists in docs/plots/
Commit score + link + plot together

A row in BENCHMARKS.md is NOT complete until it has: score, HF link, and plot.

Per-Run Graduation Checklist

After intake, graduate each finalized run to public HF benchmark:

Upload folder to public HF:

source .env && huggingface-cli upload SLM-Lab/benchmark data/benchmark-dev/data/FOLDER data/FOLDER --repo-type dataset

Update BENCHMARKS.md link: Change SLM-Lab/benchmark-dev → SLM-Lab/benchmark for that entry

Upload docs/ to public HF (updated plots + BENCHMARKS.md):

source .env && huggingface-cli upload SLM-Lab/benchmark docs docs --repo-type dataset
source .env && huggingface-cli upload SLM-Lab/benchmark README.md README.md --repo-type dataset

Commit link update
Push to origin

Launch

# Launch a run
source .env && uv run slm-lab run-remote --gpu \
  -s env=ALE/Pong-v5 SPEC_FILE SPEC_NAME train -n NAME

# Monitor
dstack ps                              # running jobs
dstack logs NAME | grep "trial_metrics" # extract score at completion

# Score = total_reward_ma from trial_metrics line
# trial_metrics: frame:1.00e+07 | total_reward_ma:816.18 | ...

Data Lifecycle

Remote GPU run → auto-uploads to benchmark-dev (HF)
  ↓ Pull to local data/
  ↓ Generate plots (docs/plots/)
  ↓ Update BENCHMARKS.md (scores, links, plots)
  ↓ Graduate to public benchmark (HF)
  ↓ Update links: benchmark-dev → benchmark
  ↓ Upload docs/ to public benchmark (HF)

Pull Data

# Pull full dataset (fast, single request — avoids rate limits)
source .env && hf download SLM-Lab/benchmark-dev \
  --local-dir data/benchmark-dev --repo-type dataset

# Or pull specific folder
source .env && hf download SLM-Lab/benchmark-dev \
  --local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"

# KEEP this data — needed for plots AND graduation upload later

Generate Plots

# Find folders for a game (check both local data/ and benchmark-dev)
ls data/ | grep -i pong
ls data/benchmark-dev/data/ | grep -i pong

# Generate comparison plot — use -d for base dir, -f for folder names only
# Use data/ as base (has info/ subfolder with trial_metrics)
uv run slm-lab plot -t "Pong-v5" -f ppo_pong_folder,sac_pong_folder,crossq_pong_folder

Graduate to Public HF

When a run is finalized, graduate individually from benchmark-dev → benchmark:

# Upload individual folder
source .env && huggingface-cli upload SLM-Lab/benchmark \
  data/benchmark-dev/data/FOLDER data/FOLDER --repo-type dataset

# Update BENCHMARKS.md link for that entry: benchmark-dev → benchmark
# Then upload docs/ (includes updated plots + BENCHMARKS.md)
source .env && huggingface-cli upload SLM-Lab/benchmark docs docs --repo-type dataset
source .env && huggingface-cli upload SLM-Lab/benchmark README.md README.md --repo-type dataset

Repo	Purpose
`SLM-Lab/benchmark-dev`	Development — noisy, iterative
`SLM-Lab/benchmark`	Public — finalized, validated

Hyperparameter Search

Only when algorithm fails to reach target:

source .env && uv run slm-lab run-remote --gpu SPEC_FILE SPEC_NAME search -n NAME

Budget: ~3-4 trials per dimension. After search: update spec with best params, run train, use that result.

Autonomous Execution

Work continuously when benchmarking. Use sleep 300 && dstack ps to actively wait (5 min intervals) — never delegate monitoring to background processes or scripts. Stay engaged in the conversation.

Workflow loop (repeat every 5-10 minutes):

Check status: dstack ps — identify completed/failed/running
Intake completed runs: For EACH completed run, do the full intake checklist above (score → HF link → pull → plot → table update)
Launch next batch: Up to 10 concurrent. Check capacity before launching more
Iterate on failures: Relaunch or adjust config immediately
Commit progress: Regular commits of score + link + plot updates

Key principle: Work continuously, check in regularly, iterate immediately on failures. Never idle. Keep reminding yourself to continue without pausing — check on tasks, update, plan, and pick up the next task immediately until all tasks are completed.

Troubleshooting

Run interrupted: Relaunch, increment name suffix (e.g., pong3 → pong4)
Low GPU usage (<50%): CPU bottleneck or config issue
HF rate limit: Download full dataset, not selective --include patterns
HF link 404: Run didn't complete or upload failed — rerun
.env inline comments: Break dstack env vars — put comments on separate lines