arena

name: arena description: Prepare and run evaluator-gated SAS and MAS experiments with explicit batch control and mandatory elasticity calibration for scenario scaling. allowed-tools: - Read - Write - Edit - Bash - Glob - Grep - WebSearch - WebFetch

Core Rules

Always run evaluator tests before experiments.
Never run multi-command experiment sequences without an explicit shared --batch-id.
Assume scaling metrics are required by default. For any full arena run, collect elasticities eta_n and eta_T.
Never estimate elasticities from one fixed (n_agents, T) point. Use controlled grids.
Do not interpret elasticity outputs if pair counts are zero or source is default_zero.
In dashboard scaling analysis, never leave Batch=all; select the explicit elasticity batch.
Set --allowed-tools explicitly for reproducible tool policy; runtime enforcement is SDK-based (can_use_tool), not metadata-only.
For tool elasticity, use core-vs-full tool sets (default --tool-count-grid 6,8): core is Read,Write,Edit,Bash,Glob,Grep; full adds WebSearch and WebFetch.

Default Arena Protocol (mandatory)

Use this unless the user explicitly requests a quick smoke test that skips scaling calibration.

Pick two explicit batch ids:
- Comparison batch: <task>_compare_<YYYYMMDD>
- Elasticity batch: <task>_elasticity_<YYYYMMDD>
Run SAS and all MAS architectures in the comparison batch.
Run elasticity sweeps for each MAS architecture in the elasticity batch.
Launch dashboard and validate non-zero elasticity pair counts.

uv run brainqub3 run sas --task <task> --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch independent   --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch centralised   --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch decentralised --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch hybrid        --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard

Run elasticity sweep once per MAS architecture:

uv run brainqub3 run elasticity \
  --task <task> \
  --arch independent \
  --model <model> \
  --batch-id <elasticity_batch_id> \
  --n-agents-grid 3,4 \
  --tool-count-grid 6,8 \
  --instances <N> \
  --require-live \
  --no-dashboard
uv run brainqub3 run elasticity --task <task> --arch centralised   --model <model> --batch-id <elasticity_batch_id> --n-agents-grid 3,4 --tool-count-grid 6,8 --instances <N> --require-live --no-dashboard
uv run brainqub3 run elasticity --task <task> --arch decentralised --model <model> --batch-id <elasticity_batch_id> --n-agents-grid 3,4 --tool-count-grid 6,8 --instances <N> --require-live --no-dashboard
uv run brainqub3 run elasticity --task <task> --arch hybrid        --model <model> --batch-id <elasticity_batch_id> --n-agents-grid 3,4 --tool-count-grid 6,8 --instances <N> --require-live --no-dashboard

Explicit Opt-Out (smoke test only)

Only skip elasticity if the user explicitly says to skip scaling/scenario calibration.

Smoke-test pattern:

uv run brainqub3 run sas --task <task> --model <model> --batch-id <smoke_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch <arch> --model <model> --batch-id <smoke_batch_id> --require-live --no-dashboard

Dashboard Check (required)

Launch dashboard after runs:

uv run brainqub3 dashboard

In Scaling Laws tab, set Batch to the explicit elasticity batch id (not all).
Confirm each architecture/metric has non-zero controlled pair counts before scenario projections.