openenv-benchmark - SKILL.md Agent Skill

name: openenv-benchmark description: Run OpenEnv scaling and concurrency benchmark experiments. Use when deploying benchmark infrastructure (local uvicorn, local docker, HF Spaces, SLURM single-node, SLURM multi-node), running test_scaling.py tests, or analyzing experiment results. Triggers on requests to benchmark, test scaling, measure concurrency, compare HTTP vs WebSocket performance, or review experiment reports.

OpenEnv Benchmark Experiments

Run scaling experiments to measure maximum concurrent batch sizes across infrastructure options.

Workflow Overview

Deploy infrastructure (choose one: local-uvicorn, local-docker, hf-spaces, slurm-single, slurm-multi)
Run scaling tests with tests/test_scaling.py
Analyze results with experiments/scripts/analyze_results.py

Step 1: Deploy Infrastructure

Prerequisites

pip install -e .  # or pip install -e ".[analysis]" for matplotlib
python -c "from benchmark.server.app import app; print('OK')"

Infrastructure Options

Infrastructure	Deploy Command	URL	Max Batch
local-uvicorn	`./deploy/local/run_uvicorn.sh`	http://localhost:8000	64-128
local-docker	`./deploy/local/run_docker.sh`	http://localhost:8000	64-128
hf-spaces	`./deploy/hf_spaces/deploy.sh --repo-id USER/openenv-benchmark`	https://USER-openenv-benchmark.hf.space	10-32
slurm-single	`sbatch deploy/slurm/serve_single.sh`	http://${SLURM_NODE_IP}:8000	128-256
slurm-multi	`./deploy/slurm/alloc.sh` then `./deploy/slurm/serve_multi.sh`	http://${ENVOY_IP}:8000	256-512

Deploy Commands

Local Uvicorn (configurable workers):

WORKERS=8 PORT=8000 MAX_CONCURRENT_ENVS=200 ./deploy/local/run_uvicorn.sh

Local Docker:

./deploy/local/run_docker.sh
# Or manually: docker run -d --name openenv-benchmark -p 8000:8000 -e WORKERS=4 openenv-benchmark:latest

HF Spaces:

export HF_USER="your-username"
./deploy/hf_spaces/deploy.sh --repo-id ${HF_USER}/openenv-benchmark
# Wake up before testing:
curl https://${HF_USER}-openenv-benchmark.hf.space/health

SLURM Single Node:

sbatch deploy/slurm/serve_single.sh
export JOB_ID=$(squeue -u $USER -h -o "%i" | head -1)
export SLURM_NODE_IP=$(squeue -j $JOB_ID -h -o "%N")
# Wait for server:
while ! curl -s http://${SLURM_NODE_IP}:8000/health > /dev/null 2>&1; do sleep 5; done

SLURM Multi-Node (with Envoy load balancer):

WORKERS=4 CPUS_PER_WORKER=4 ./deploy/slurm/alloc.sh  # Opens interactive shell
./deploy/slurm/serve_multi.sh
source openenv-connection.env
echo "URL: $OPENENV_URL"

Verify Deployment

curl http://localhost:8000/health
python tests/test_scaling.py --url http://localhost:8000 -n 5 -w 0.5

Step 2: Run Scaling Tests

test_scaling.py CLI Reference

Option	Default	Description
`--url, -u`	http://localhost:8000	Server URL
`--requests, -n`	10	Concurrent requests (batch size)
`--wait, -w`	1.0	Wait time per request (seconds)
`--mode, -m`	ws	Test mode: `http` or `ws`
`--requests-grid`	-	Comma-separated batch sizes for grid sweep
`--wait-grid`	-	Comma-separated wait times for grid sweep
`--reps`	1	Repetitions per configuration
`--compare`	false	Run both HTTP and WebSocket
`--output-dir, -o`	-	Output directory for JSONL/CSV
`--timeout, -t`	120.0	Timeout per request

Standard Experiment

Full grid sweep comparing HTTP vs WebSocket:

python tests/test_scaling.py \
    --url http://localhost:8000 \
    --requests-grid 1,2,4,8,16,32,64,128 \
    --wait-grid 0.1,1.0,5.0 \
    --reps 3 \
    --compare \
    --output-dir experiments/results/local-uvicorn/$(date +%Y-%m-%d)

Quick Validation Test

python tests/test_scaling.py \
    --url http://localhost:8000 \
    --requests-grid 1,4,16,64 \
    --wait-grid 1.0 \
    --reps 1 \
    --mode ws \
    --output-dir experiments/results/local-uvicorn/quick-test

Infrastructure-Specific Recommendations

HF Spaces Free Tier: Use --requests-grid 1,2,4,8,16 --timeout 180
SLURM Single: Use --requests-grid 1,2,4,8,16,32,64,128,256
SLURM Multi: Use --requests-grid 1,2,4,8,16,32,64,128,256,512

Step 3: Analyze Results

Output Files

Tests generate:

raw.jsonl - Per-session detailed results (request_id, latencies, pid, session_hash, host_url, errors)
summary.csv - Aggregated statistics (success rates, p50/p90/p95/p99 latencies, throughput, effective_concurrency)

analyze_results.py CLI Reference

# Analyze single experiment
python experiments/scripts/analyze_results.py \
    --input experiments/results/local-uvicorn/2026-01-09

# Analyze all infrastructures
python experiments/scripts/analyze_results.py --all

# Custom success threshold (default 95%)
python experiments/scripts/analyze_results.py \
    --input experiments/results/local-uvicorn/2026-01-09 \
    --success-threshold 0.90

Option	Description
`--input, -i`	Input directory with raw.jsonl and summary.csv
`--all`	Analyze all infrastructures in experiments/results/
`--output, -o`	Output directory for figures (default: experiments/reports/figures/)
`--success-threshold`	Success rate threshold for max batch (default: 0.95)
`--tables-only`	Generate tables only, skip figures
`--figures-only`	Generate figures only, skip tables

Generated Reports

experiments/reports/tables.md - Markdown tables (max batch, protocol comparison, latency breakdown)
experiments/reports/figures/ - PNG plots (max_batch_comparison.png, scaling_curves.png, latency_heatmap.png)
experiments/reports/EXPERIMENT_LOG.md - Run history

Key Metrics to Review

Max Batch Size: Largest concurrent batch achieving 95% success rate
Protocol Comparison: WS typically 10-20x higher throughput than HTTP
Latency Breakdown: connect_p50, reset_p50, step_p50, total_p99
Distribution Metrics: unique_pids, unique_sessions, unique_hosts (verify load balancing)

Verify Load Balancing (Multi-Node)

python -c "
import json
hosts = set()
with open('experiments/results/slurm-multi/$(date +%Y-%m-%d)/raw.jsonl') as f:
    for line in f:
        data = json.loads(line)
        if data.get('host_url'):
            hosts.add(data['host_url'])
print(f'Unique hosts: {len(hosts)}')
print(hosts)
"

Cleanup

# Local uvicorn
pkill -f "uvicorn benchmark.server.app"

# Local docker
docker stop openenv-benchmark && docker rm openenv-benchmark

# SLURM
scancel $JOB_ID  # or exit the allocation shell

Troubleshooting

Issue	Solution
Port in use	`lsof -i :8000` then `kill -9 <PID>`
Connection refused	Verify server running: `curl http://localhost:8000/health`
High error rate	Reduce MAX_CONCURRENT_ENVS or increase WORKERS
HF Space sleeping	Send health check requests to wake up
SLURM job won't start	Check `sinfo -p hopper-cpu` for partition availability
Uneven load distribution	Verify all worker nodes started, check Envoy config