name: arena description: Prepare and run evaluator-gated SAS and MAS experiments with explicit batch control and mandatory elasticity calibration for scenario scaling. allowed-tools: - Read - Write - Edit - Bash - Glob - Grep - WebSearch - WebFetch
arena
Core Rules
- Always run evaluator tests before experiments.
- Never run multi-command experiment sequences without an explicit shared
--batch-id. - Assume scaling metrics are required by default. For any full arena run, collect elasticities
eta_nandeta_T. - Never estimate elasticities from one fixed
(n_agents, T)point. Use controlled grids. - Do not interpret elasticity outputs if pair counts are zero or source is
default_zero. - In dashboard scaling analysis, never leave
Batch=all; select the explicit elasticity batch. - Set
--allowed-toolsexplicitly for reproducible tool policy; runtime enforcement is SDK-based (can_use_tool), not metadata-only. - For tool elasticity, use core-vs-full tool sets (default
--tool-count-grid 6,8): core isRead,Write,Edit,Bash,Glob,Grep; full addsWebSearchandWebFetch.
Default Arena Protocol (mandatory)
Use this unless the user explicitly requests a quick smoke test that skips scaling calibration.
- Pick two explicit batch ids:
- Comparison batch:
<task>_compare_<YYYYMMDD> - Elasticity batch:
<task>_elasticity_<YYYYMMDD>
- Comparison batch:
- Run SAS and all MAS architectures in the comparison batch.
- Run elasticity sweeps for each MAS architecture in the elasticity batch.
- Launch dashboard and validate non-zero elasticity pair counts.
uv run brainqub3 run sas --task <task> --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch independent --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch centralised --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch decentralised --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch hybrid --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
Run elasticity sweep once per MAS architecture:
uv run brainqub3 run elasticity \
--task <task> \
--arch independent \
--model <model> \
--batch-id <elasticity_batch_id> \
--n-agents-grid 3,4 \
--tool-count-grid 6,8 \
--instances <N> \
--require-live \
--no-dashboard
uv run brainqub3 run elasticity --task <task> --arch centralised --model <model> --batch-id <elasticity_batch_id> --n-agents-grid 3,4 --tool-count-grid 6,8 --instances <N> --require-live --no-dashboard
uv run brainqub3 run elasticity --task <task> --arch decentralised --model <model> --batch-id <elasticity_batch_id> --n-agents-grid 3,4 --tool-count-grid 6,8 --instances <N> --require-live --no-dashboard
uv run brainqub3 run elasticity --task <task> --arch hybrid --model <model> --batch-id <elasticity_batch_id> --n-agents-grid 3,4 --tool-count-grid 6,8 --instances <N> --require-live --no-dashboard
Explicit Opt-Out (smoke test only)
Only skip elasticity if the user explicitly says to skip scaling/scenario calibration.
Smoke-test pattern:
uv run brainqub3 run sas --task <task> --model <model> --batch-id <smoke_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch <arch> --model <model> --batch-id <smoke_batch_id> --require-live --no-dashboard
Dashboard Check (required)
- Launch dashboard after runs:
uv run brainqub3 dashboard
- In Scaling Laws tab, set
Batchto the explicit elasticity batch id (notall). - Confirm each architecture/metric has non-zero controlled pair counts before scenario projections.