arena

star 16

Prepare and run evaluator-gated SAS and MAS experiments with explicit batch control and mandatory elasticity calibration for scenario scaling.

brainqub3 By brainqub3 schedule Updated 2/11/2026

name: arena description: Prepare and run evaluator-gated SAS and MAS experiments with explicit batch control and mandatory elasticity calibration for scenario scaling. allowed-tools: - Read - Write - Edit - Bash - Glob - Grep - WebSearch - WebFetch

arena

Core Rules

  1. Always run evaluator tests before experiments.
  2. Never run multi-command experiment sequences without an explicit shared --batch-id.
  3. Assume scaling metrics are required by default. For any full arena run, collect elasticities eta_n and eta_T.
  4. Never estimate elasticities from one fixed (n_agents, T) point. Use controlled grids.
  5. Do not interpret elasticity outputs if pair counts are zero or source is default_zero.
  6. In dashboard scaling analysis, never leave Batch=all; select the explicit elasticity batch.
  7. Set --allowed-tools explicitly for reproducible tool policy; runtime enforcement is SDK-based (can_use_tool), not metadata-only.
  8. For tool elasticity, use core-vs-full tool sets (default --tool-count-grid 6,8): core is Read,Write,Edit,Bash,Glob,Grep; full adds WebSearch and WebFetch.

Default Arena Protocol (mandatory)

Use this unless the user explicitly requests a quick smoke test that skips scaling calibration.

  1. Pick two explicit batch ids:
    • Comparison batch: <task>_compare_<YYYYMMDD>
    • Elasticity batch: <task>_elasticity_<YYYYMMDD>
  2. Run SAS and all MAS architectures in the comparison batch.
  3. Run elasticity sweeps for each MAS architecture in the elasticity batch.
  4. Launch dashboard and validate non-zero elasticity pair counts.
uv run brainqub3 run sas --task <task> --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch independent   --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch centralised   --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch decentralised --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch hybrid        --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard

Run elasticity sweep once per MAS architecture:

uv run brainqub3 run elasticity \
  --task <task> \
  --arch independent \
  --model <model> \
  --batch-id <elasticity_batch_id> \
  --n-agents-grid 3,4 \
  --tool-count-grid 6,8 \
  --instances <N> \
  --require-live \
  --no-dashboard
uv run brainqub3 run elasticity --task <task> --arch centralised   --model <model> --batch-id <elasticity_batch_id> --n-agents-grid 3,4 --tool-count-grid 6,8 --instances <N> --require-live --no-dashboard
uv run brainqub3 run elasticity --task <task> --arch decentralised --model <model> --batch-id <elasticity_batch_id> --n-agents-grid 3,4 --tool-count-grid 6,8 --instances <N> --require-live --no-dashboard
uv run brainqub3 run elasticity --task <task> --arch hybrid        --model <model> --batch-id <elasticity_batch_id> --n-agents-grid 3,4 --tool-count-grid 6,8 --instances <N> --require-live --no-dashboard

Explicit Opt-Out (smoke test only)

Only skip elasticity if the user explicitly says to skip scaling/scenario calibration.

Smoke-test pattern:

uv run brainqub3 run sas --task <task> --model <model> --batch-id <smoke_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch <arch> --model <model> --batch-id <smoke_batch_id> --require-live --no-dashboard

Dashboard Check (required)

  1. Launch dashboard after runs:
uv run brainqub3 dashboard
  1. In Scaling Laws tab, set Batch to the explicit elasticity batch id (not all).
  2. Confirm each architecture/metric has non-zero controlled pair counts before scenario projections.
Install via CLI
npx skills add https://github.com/brainqub3/agent-labs --skill arena
Repository Details
star Stars 16
call_split Forks 6
navigation Branch main
article Path SKILL.md
More from Creator