name: slm-lab-benchmark description: Run SLM-Lab deep RL benchmarks, monitor dstack jobs, extract results, and update BENCHMARKS.md. Use when asked to run benchmarks, check run status, extract scores, update benchmark tables, or generate plots.
SLM-Lab Benchmark Skill
Critical Rules
- NEVER push to remote without explicit user permission
- ONLY train runs in BENCHMARKS.md — never search results
- Respect Settings line for each env (max_frame, num_envs, etc.)
- Use
${max_frame}variable in specs — never hardcode - Runs must complete in <6h (dstack max_duration)
- Max 10 concurrent dstack runs — launch in batches of 10, wait for capacity/completion before launching more. Never submit all runs at once; dstack capacity is limited and mass submissions cause "no offers" failures
Per-Run Intake Checklist
Every completed run MUST go through ALL of these steps. No exceptions. Do not skip any step.
When a run completes (dstack ps shows exited (0)):
- Extract score:
dstack logs NAME | grep "trial_metrics"→ gettotal_reward_ma - Find HF folder name:
dstack logs NAME 2>&1 | grep "Uploading data/"→ extract folder name from the upload log line - Update table score in BENCHMARKS.md
- Update table HF link:
[FOLDER](https://huggingface.co/datasets/SLM-Lab/benchmark-dev/tree/main/data/FOLDER) - Pull HF data locally:
source .env && huggingface-cli download SLM-Lab/benchmark-dev --local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*" - Generate plot: List ALL data folders for that env (
ls data/benchmark-dev/data/ | grep -i envname), then generate with ONLY the folders matching BENCHMARKS.md entries:
NOTE:uv run slm-lab plot -t "EnvName" -d data/benchmark-dev/data -f FOLDER1,FOLDER2,...-dsets the base data dir,-ftakes folder names (NOT full paths). If some folders are indata/(local runs) and some indata/benchmark-dev/data/, usedata/as base (it has theinfo/subfolder needed for metrics). - Verify plot exists in
docs/plots/ - Commit score + link + plot together
A row in BENCHMARKS.md is NOT complete until it has: score, HF link, and plot.
Per-Run Graduation Checklist
After intake, graduate each finalized run to public HF benchmark:
- Upload folder to public HF:
source .env && huggingface-cli upload SLM-Lab/benchmark data/benchmark-dev/data/FOLDER data/FOLDER --repo-type dataset - Update BENCHMARKS.md link: Change
SLM-Lab/benchmark-dev→SLM-Lab/benchmarkfor that entry - Upload docs/ to public HF (updated plots + BENCHMARKS.md):
source .env && huggingface-cli upload SLM-Lab/benchmark docs docs --repo-type dataset source .env && huggingface-cli upload SLM-Lab/benchmark README.md README.md --repo-type dataset - Commit link update
- Push to origin
Launch
# Launch a run
source .env && uv run slm-lab run-remote --gpu \
-s env=ALE/Pong-v5 SPEC_FILE SPEC_NAME train -n NAME
# Monitor
dstack ps # running jobs
dstack logs NAME | grep "trial_metrics" # extract score at completion
# Score = total_reward_ma from trial_metrics line
# trial_metrics: frame:1.00e+07 | total_reward_ma:816.18 | ...
Data Lifecycle
Remote GPU run → auto-uploads to benchmark-dev (HF)
↓ Pull to local data/
↓ Generate plots (docs/plots/)
↓ Update BENCHMARKS.md (scores, links, plots)
↓ Graduate to public benchmark (HF)
↓ Update links: benchmark-dev → benchmark
↓ Upload docs/ to public benchmark (HF)
Pull Data
# Pull full dataset (fast, single request — avoids rate limits)
source .env && hf download SLM-Lab/benchmark-dev \
--local-dir data/benchmark-dev --repo-type dataset
# Or pull specific folder
source .env && hf download SLM-Lab/benchmark-dev \
--local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"
# KEEP this data — needed for plots AND graduation upload later
Generate Plots
# Find folders for a game (check both local data/ and benchmark-dev)
ls data/ | grep -i pong
ls data/benchmark-dev/data/ | grep -i pong
# Generate comparison plot — use -d for base dir, -f for folder names only
# Use data/ as base (has info/ subfolder with trial_metrics)
uv run slm-lab plot -t "Pong-v5" -f ppo_pong_folder,sac_pong_folder,crossq_pong_folder
Graduate to Public HF
When a run is finalized, graduate individually from benchmark-dev → benchmark:
# Upload individual folder
source .env && huggingface-cli upload SLM-Lab/benchmark \
data/benchmark-dev/data/FOLDER data/FOLDER --repo-type dataset
# Update BENCHMARKS.md link for that entry: benchmark-dev → benchmark
# Then upload docs/ (includes updated plots + BENCHMARKS.md)
source .env && huggingface-cli upload SLM-Lab/benchmark docs docs --repo-type dataset
source .env && huggingface-cli upload SLM-Lab/benchmark README.md README.md --repo-type dataset
| Repo | Purpose |
|---|---|
SLM-Lab/benchmark-dev |
Development — noisy, iterative |
SLM-Lab/benchmark |
Public — finalized, validated |
Hyperparameter Search
Only when algorithm fails to reach target:
source .env && uv run slm-lab run-remote --gpu SPEC_FILE SPEC_NAME search -n NAME
Budget: ~3-4 trials per dimension. After search: update spec with best params, run train, use that result.
Autonomous Execution
Work continuously when benchmarking. Use sleep 300 && dstack ps to actively wait (5 min intervals) — never delegate monitoring to background processes or scripts. Stay engaged in the conversation.
Workflow loop (repeat every 5-10 minutes):
- Check status:
dstack ps— identify completed/failed/running - Intake completed runs: For EACH completed run, do the full intake checklist above (score → HF link → pull → plot → table update)
- Launch next batch: Up to 10 concurrent. Check capacity before launching more
- Iterate on failures: Relaunch or adjust config immediately
- Commit progress: Regular commits of score + link + plot updates
Key principle: Work continuously, check in regularly, iterate immediately on failures. Never idle. Keep reminding yourself to continue without pausing — check on tasks, update, plan, and pick up the next task immediately until all tasks are completed.
Troubleshooting
- Run interrupted: Relaunch, increment name suffix (e.g., pong3 → pong4)
- Low GPU usage (<50%): CPU bottleneck or config issue
- HF rate limit: Download full dataset, not selective
--includepatterns - HF link 404: Run didn't complete or upload failed — rerun
- .env inline comments: Break dstack env vars — put comments on separate lines