Explore AI Agent Skills & Claude Prompts

star 3.8k

Update the body of a GitHub pull request. Use when the user asks to update, edit, or modify a PR description/body.

schedule Updated 4 months ago

monitor-experiment

star 3.8k

Monitor Beaker experiments until completion. Use when the user asks to monitor, check, or track a Beaker experiment.

schedule Updated 5 months ago

training-smoke-test

star 1.3k

This skill should be used when the user asks to "create a verification script", "write a test training run", "make a quick training job", "verify my change with a short run", "launch a smoke test", or wants a small Beaker job to validate that a feature works end-to-end on GPUs. Also triggers when the user mentions creating a modified 190M script to test a specific behavior.

schedule Updated 3 months ago

run-evaluation

star 373

Run a VLA model evaluation against a simulation benchmark. Use this skill whenever the user wants to evaluate, benchmark, test, or run a model on a sim environment — even if they say it casually like 'try OpenVLA on LIBERO' or 'get me CALVIN scores'. Covers the full workflow: serving the model, launching the benchmark, sharding for speed, merging results, and interpreting output.

schedule Updated 23 days ago

add-benchmark

star 373

Add a new simulation benchmark to the VLA evaluation harness. Use this skill whenever the user wants to integrate, create, or add a new benchmark or simulation environment — e.g. 'add ManiSkill3', 'integrate OmniGibson', 'hook up a new sim'. Also use when they ask how benchmarks are structured or want to understand the benchmark interface.

add-model-server

star 373

Add a new VLA model server to the evaluation harness. Use this skill whenever the user wants to integrate, create, or add a new model — e.g. 'add OpenVLA server', 'integrate RT-2', 'hook up my model', 'write a model server'. Also use when they ask how model servers work or want to understand the server interface.

schedule Updated 23 days ago

hugging-face-evaluation

star 36

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.

hugging-face-datasets

star 36

Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.

eval-templates

star 36

Runnable evaluation template scripts for ML tasks. Match task_type to template, adapt CONFIG, run.

workspace

star 16

Show the user the agent's work on a research project and save iterations on the user's behalf. Scaffold rendering and deploy infrastructure (Quarto today, GitHub Pages, dev container), show the rendered output, save iterations. Doesn't handle research execution (use `research-step`).

schedule Updated 21 days ago

generate-theories

star 16

This skill should be used when the user asks to "generate theories", "theorize about", "what theories explain", "form scientific theories", "literature-driven theories", "hypothesize", "form hypotheses", "generate hypotheses", "what hypotheses explain", "run the theorizer", or wants AI-generated, literature-grounded scientific theories or hypotheses about a research question.

pdf-extraction

star 16

Extract text from PDFs using olmOCR or remote OCR. Use when user asks to "extract text from PDF", "OCR a document", "read a PDF", or needs to process scanned documents.