name: bubench-run description: Use this skill whenever the user describes a benchmark experiment using any combination of these parameters — agent (e.g. browser-use, skyvern), model (e.g. deepseek, minimax, claude, gemini, gpt), data/benchmark (e.g. LexBench-Browser), browser (e.g. Chrome-Local, lexmount), and tasks (specific IDs or "all"). Trigger phrases include "跑实验 agent=xxx model=xxx", "跑实验 agent xxx model xxx", "run [model] on [tasks]", "eval [model] results", "re-run [model]", or any request that specifies an agent + model + task set to execute or evaluate. Also triggers on requests to chain run→eval pipelines, check experiment progress, or read final results.
BrowserUse-Bench Run & Eval Workflow
This skill guides running benchmark experiments on browseruse-bench: launching agent runs, evaluating results, and chaining multiple models in sequence — all using background processes and monitors so Claude stays responsive while jobs run.
Key Commands
# Run a model on specific tasks (background)
uv run scripts/run.py --agent browser-use --data LexBench-Browser \
--mode specific --model-name <model-name> \
--task-ids <ids...> \
> output/logs/run/<label>.log 2>&1 &
# Evaluate results (background)
uv run scripts/eval.py --agent browser-use --data LexBench-Browser \
--model-id <model-id> \
> output/logs/eval/<label>.log 2>&1 &
# Evaluate a specific timestamp directory (for incremental/second passes)
uv run scripts/eval.py --agent browser-use --data LexBench-Browser \
--model-id <model-id> --timestamp <YYYYMMDD_HHMMSS> \
> output/logs/eval/<label>.log 2>&1 &
Model Name → Model ID Mapping (from config.yaml)
--model-name |
--model-id (in experiments path) |
|---|---|
deepseek |
deepseek-v4-pro |
minimax |
MiniMax-M2.7 |
claude |
dmx-claude-opus-4-7 |
gemini |
gemini-2.5-pro |
gpt |
gpt-5.5 |
Always check config.yaml to confirm the current model_id before running eval — it must match the directory name under experiments/.
Output Paths
experiments/LexBench-Browser/All/browser-use/<model-id>/<timestamp>/
tasks/ # one subdir per task
tasks_eval_result/
eval.log # live eval progress
task_gpt-5.4_per_task_threshold_stepwise_summary.json
Run logs: output/logs/run/<timestamp>.log
Eval logs: output/logs/eval/<label>.log
Monitoring Pattern
Always use Monitor (not polling loops) to track background jobs.
Run progress (persistent — tasks take hours):
tail -f output/logs/run/<label>.log | grep --line-buffered \
-E "\[[0-9]+/N\]\[[0-9]+\] (completed|failed)|Run complete|ERROR"
Eval progress (non-persistent — usually finishes in <1 hour):
tail -f experiments/.../tasks_eval_result/eval.log | grep --line-buffered \
-E "PASS|FAIL|SUCCESS:|ERROR"
Wait for process to exit (to auto-chain next step):
until ! pgrep -f "eval.py.*<model-id>" > /dev/null 2>&1; do sleep 10; done \
&& echo "EVAL DONE"
Wait for a task directory to appear (before second eval pass):
until [ -d "experiments/.../tasks/<task-id>" ]; do sleep 10; done \
&& echo "TASK READY"
Sequential Pipeline for Multiple Models
When chaining run→eval across several models, do them one at a time:
- Start model A run (background + persistent monitor)
- When run completes → start model A eval (background + monitor)
- When eval completes → start model B run
- Repeat
Don't start the next model's run until the current eval finishes — this avoids Chrome browser contention and keeps logs clean.
Incremental Eval (the key trick)
By default, eval.py skips tasks that already have results. This enables:
- First pass: eval starts as soon as most tasks are done, skipping tasks not yet written
- Second pass: re-run eval after the run fully completes — it only evaluates the missing tasks (the ones that were still running during the first pass)
- No
--force-reevalneeded unless you actually want to re-score everything
Use --timestamp <YYYYMMDD_HHMMSS> to target a specific run directory when there are multiple timestamps for the same model.
Reading Final Results
python3 -c "
import json
with open('experiments/LexBench-Browser/All/browser-use/<model-id>/<timestamp>/tasks_eval_result/task_gpt-5.4_per_task_threshold_stepwise_summary.json') as f:
d = json.load(f)
s = d['overall_statistics']
print(f'Tasks: {s[\"total_tasks\"]}, Success: {s[\"successful_tasks\"]}, Rate: {s[\"success_rate\"]}%')
print(f'Successful IDs: {d[\"task_list\"][\"successful_task_ids\"]}')
"
Or use find_latest_tasks_dir() logic: latest timestamp = max() by directory name.
Error Handling — What to Ignore
The following appear constantly in run logs and are self-recovering — don't intervene:
Result failed N/6 times: validation error— model JSON parse retry, always recoversnet::ERR_CONNECTION_RESET / ERR_CONNECTION_TIMED_OUT— network blip, agent retriesCDP requests failed or timed out: ax_tree— browser DOM timeout, recovers on next stepNavigation failed: event handler timed out— slow site, agent retriesTask X timed out after 600 seconds— hit max time, still recorded as completed
Watch only for [N/N][task-id] completed or Run complete lines to know real progress.
If a task produces ERROR [Agent] ❌ Stopping due to 5 consecutive failures followed immediately by [N/N][task-id] completed, it recovered — no action needed.
Standard Task Set (50 IDs)
The default task set used in LexBench-Browser experiments:
83 85 87 89 94 124 125 128 144 148 150 163 166 172 174 175 179 180 182 183
184 187 188 189 196 197 199 205 206 208 209 210 212 213 216 217 218 221 223
227 233 236 241 255 263 276 292 294 298 304
Pass as --task-ids (space-separated). When user says "tasks=50ids" or "standard tasks", use this list.
Pre-flight Checklist
Before starting a run:
- Confirm
browser_id: Chrome-Localinconfig.yaml(or the intended browser) - Confirm
active_modelis set to the intended model name under the agent section - Or use
--model-name <name>to override without editing config - Check
output/logs/for any existing run/eval processes still active:ps aux | grep -E "run.py|eval.py" | grep -v grep