run-evaluation

star 373

Run a VLA model evaluation against a simulation benchmark. Use this skill whenever the user wants to evaluate, benchmark, test, or run a model on a sim environment — even if they say it casually like 'try OpenVLA on LIBERO' or 'get me CALVIN scores'. Covers the full workflow: serving the model, launching the benchmark, sharding for speed, merging results, and interpreting output.

allenai By allenai schedule Updated 6/1/2026

name: run-evaluation description: "Run a VLA model evaluation against a simulation benchmark. Use this skill whenever the user wants to evaluate, benchmark, test, or run a model on a sim environment — even if they say it casually like 'try OpenVLA on LIBERO' or 'get me CALVIN scores'. Covers the full workflow: serving the model, launching the benchmark, sharding for speed, merging results, and interpreting output."

Run Evaluation

Evaluate a VLA model against a simulation benchmark. The harness decouples model serving (WebSocket server) from benchmark execution (Docker container), so they run as two separate processes.

1. Identify the config pair

Every evaluation needs two YAML configs:

  • Model server config (configs/model_servers/<model>.yaml) — defines script and args for the model server
  • Benchmark config (configs/<benchmark>.yaml) — defines docker.image, benchmarks entries, and output_dir

List available configs:

ls configs/model_servers/    # model servers
ls configs/*.yaml            # benchmarks

Not all model–benchmark pairs are compatible. The model server must produce actions in the format the benchmark expects (e.g. 7-DoF for LIBERO). Many model configs encode their target benchmark in the filename (e.g. oft_libero.yaml, xvla_calvin.yaml).

2. Check prerequisites

Requirement Check command Notes
uv which uv Runs model server in isolated env
Docker docker info Benchmarks run inside containers
GPU nvidia-smi Model inference + sim rendering
Disk space df -h Model weights (tens of GB) + Docker images (4–10 GB each)

Model weights download automatically on first vla-eval serve. Docker images are pulled on first vla-eval run (or pre-pull with docker pull <image>).

Docker image rebuild: Benchmark code runs inside the Docker image. If you (or someone else) changed benchmark source code in src/vla_eval/benchmarks/, the pre-built image is stale — you must rebuild before running:

./docker/build.sh <benchmark_name>   # e.g. ./docker/build.sh libero

Skip the rebuild only if using --dev mode, which bind-mounts local src/ into the container.

3. Run the evaluation (two terminals)

The model server and benchmark runner communicate over WebSocket and must run concurrently.

Terminal 1 — start the model server:

vla-eval serve -c configs/model_servers/<model>.yaml

Wait until curl -fsS http://localhost:8000/health returns HTTP 200 — the server only starts listening after __init__ finishes loading weights, so this is the readiness signal.

Remote serving via slurm (when model needs GPUs on a different node):

srun --gres=gpu:1 --mem=32G --job-name=model-serve \
  bash -c "uv run vla-eval serve -c configs/model_servers/<model>.yaml" &

Check the allocated node with squeue, verify with curl -s http://<node>:8000/health, then use --server-url ws://<node>:8000 for benchmark runs. Cancel with scancel when done.

Terminal 2 — run the benchmark:

vla-eval run -c configs/<benchmark>.yaml

When the model server is on a remote node, use --server-url to override:

vla-eval run -c configs/<benchmark>.yaml --server-url ws://<slurm-node>:8000

This pulls the Docker image if needed, launches the container with --network host, runs all episodes, and saves results to output_dir (default ./results/).

--dev mode: If you changed code in src/ since the Docker image was last built, add --dev to bind-mount local source into the container. Without it, the container runs stale code.

Add -v to either command for debug logging.

4. Parallel sharding

Single-shard runs can take hours. Sharding splits episodes across multiple Docker containers that all connect to the same model server.

# Example: 4-way parallel
for i in 0 1 2 3; do
  vla-eval run -c configs/<benchmark>.yaml --shard-id $i --num-shards 4 &
done
wait

Sharding details:

  • Work items distributed round-robin (deterministic, reproducible)
  • Each shard writes {name}_shard{id}of{total}.json
  • GPU assigned round-robin (shard 0 → GPU 0, shard 1 → GPU 1, …)
  • CPU cores partitioned evenly; OMP_NUM_THREADS=1 per container

Override resource allocation:

vla-eval run -c config.yaml --gpus "0,1" --cpus "0-31"

See docs/tuning-guide.md for how to derive optimal num_shards, max_batch_size, and max_wait_time.

5. Merge shard results

vla-eval merge -c configs/<benchmark>.yaml -o results/merged.json
# or manually:
vla-eval merge results/*_shard*of4.json -o results/merged.json

Missing shards are allowed — the merged result is marked partial.

6. Understand results

Results are JSON in output_dir. Structure:

{
  "benchmark": "LIBEROBenchmark_libero_spatial",
  "mean_success": 0.968,
  "tasks": [
    {
      "task": "pick_up_the_black_bowl...",
      "mean_success": 0.96,
      "num_episodes": 50,
      "avg_steps": 95.2,
      "episodes": [
        {"episode_id": 0, "metrics": {"success": true}, "steps": 78, "elapsed_sec": 12.34},
        {"episode_id": 1, "metrics": {"success": false}, "steps": 220, "failure_reason": "timeout", "failure_detail": "..."}
      ]
    }
  ]
}

Key metrics:

  • mean_success — primary metric (fraction of successful episodes, all episodes count)
  • Per-task mean_success — breakdown by task
  • avg_steps — efficiency (lower = better)
  • num_errors — present on tasks that had episodes with failure_reason (connection errors, exceptions, etc.)
  • failure_reason / failure_detail — per-episode diagnostic fields for debugging failures

7. Advanced options

Option Command Purpose
No Docker --no-docker Dev/debug, requires local benchmark deps
Dev mode --dev Bind-mounts local src/ into container (no rebuild needed)
Real-time mode Set mode: realtime in config For control benchmarks (Kinetix)
Skip Docker prompt --yes Non-interactive image pull
Custom overrides Edit config YAML episodes_per_task, max_steps, max_tasks, params.seed, server.timeout

Custom output directory

Override output_dir with --output-dir:

vla-eval run -c configs/benchmarks/libero/spatial.yaml --output-dir results/my-experiment/

Default is ./results/ (from config YAML). The CLI flag takes precedence over the config value.

Parallel evaluations of different models

Shard result files are named by benchmark + shard count. If two evals share the same benchmark config, shard count, and output directory, a file lock prevents silent overwrites. Use different output_dir values or different shard counts to avoid collisions.

Monitoring shard progress

Each shard writes a .progress file that updates after every episode. Use watch for a live dashboard:

watch -n 2 'for f in results/*.progress; do echo "$(basename $f .progress): $(cat $f)"; done; echo "---"; echo "Done: $(ls results/*shard*of*.json 2>/dev/null | wc -l) shards"'

Progress files are removed automatically when the shard finishes and writes its result JSON. Lock files are also cleaned up on completion.

Troubleshooting

Problem Fix
Docker daemon not running Start Docker (may need sysadmin on shared clusters)
Connection refused Server not ready — wait for GET /health to return HTTP 200
TimeoutError Increase server.timeout in config or check GPU utilization
OOM Reduce batch size or use smaller checkpoint
Mismatched action dims Check unnorm_key and chunk_size in model server config
Partial results Server disconnected — results up to that point saved automatically
Install via CLI
npx skills add https://github.com/allenai/vla-evaluation-harness --skill run-evaluation
Repository Details
star Stars 373
call_split Forks 34
navigation Branch main
article Path SKILL.md
More from Creator