name: coding-benchmark-runner description: Run the 15-problem Python coding & algorithmic soundness benchmark against local models on the llama.cpp router. tags: [benchmark, coding, local-models, llama-cpp]
Coding Benchmark Runner
Run the 15-problem Python coding & algorithmic soundness benchmark against local models served via the llama.cpp router.
Location
~/llm-server/coding_test/ — contains run_bench.py, prompts.jsonl, test_cases.py, reference_solutions.py, test_harness.py
Pre-flight
cd ~/llm-server/coding_test && python3 test_harness.py
All 32 assertions must pass before running real benchmarks.
Known issues & fixes
Python 3.14 multiprocessing: Python 3.14 defaults to
forkservercontext. The harness usesmp.set_start_method("fork")inmain(), buttest_harness.pyneeded the same fix added at module level before importingrun_bench. Iftest_harness.pyfails withRuntimeError: An attempt has been made to start a new process..., addimport multiprocessing as mp; mp.set_start_method("fork")at the top.Thinking models (GLM-4, etc.): Some models return all output in
reasoning_contentwith emptycontent.call_model()inrun_bench.pynow falls back toreasoning_contentwhencontentis empty. If you see 0/0 scores on a model that clearly produced output, check the API response structure.Stdout buffering: Always run with
python3 -uin background mode, otherwise output is held until process exits and you can't monitor progress.Progress logging:
run_bench.pynow prints per-problem progress with timestamps:[wall: Xs]for total elapsed, model response time, refactor response time, and per-problem duration. All prints useflush=True. If you see no output for minutes, the model is in its thinking phase — normal for GLM-4 etc.max_tokens: Bumped to 8192 (was 2048, then 4096). The p14_simple_calculator problem (diff=4) needs ~2600+ tokens. GLM-4 at 14.5 t/s hit the 2048 limit and produced truncated syntax errors. 8192 is safe for all models.
HTTP timeout: Auto-calculated as
max_tokens * 0.12s/token + 60s margin(~1043s for 8192 tokens). Previous fixed 600s timeout caused carnice-27b to fail on p06/p07/p08 — dense 27B at 55W generating 6K+ tokens easily exceeds 600s. The 500 Internal Server Error on p14 was likely OOM or llama.cpp bug, not a timeout issue. Do NOT hardcode timeouts.Incremental results saving: Results JSON is written after every problem (not just at the end). This prevents total data loss if the process is killed or crashes. The JSON includes a
progressfield like"8/15". If a run is interrupted, partial results are still in the output file.Token tracking: Each problem result now includes
solve_time_s,refactor_time_s,scoring_time_s,total_time_s,solve_tokens,refactor_tokens. The stdout line also shows total tokens per problem:[94.4s, 1234 tok].Don't kill running processes unnecessarily: The harness only writes to JSON at the end of each problem. Killing mid-problem loses that problem's data. Only kill between problems if needed (check the JSON
progressfield first).
Running a benchmark
cd ~/llm-server/coding_test && python3 -u run_bench.py \
--endpoint http://localhost:8080/v1 \
--model <model-name> \
--prompts prompts.jsonl \
--out results_<model>.json
--skip-refactorhalves runtime (no refactor stability metric)- Router endpoint:
http://localhost:8080/v1 - Model name must match the
[section]name in~/llm-server/router-preset.ini
Grammar-variant coding benchmark (standard vs compact CoT)
To benchmark models with AND without grammar constraints (e.g., compact CoT coding grammar vs free-form), use the combined harness:
cd ~/llm-server/coding_test && python3 run_coding_grammar_bench.py \
--endpoint http://localhost:8079 \
--models qwen36-27b-Q4_0,qwen36-27b-cot \
[--skip-refactor] \
--out results_coding_grammar.json
This runs 4 combinations (2 models × 2 grammars) sequentially and produces:
- Per-(model, grammar) summary: pass@1, complexity, numerical, refactor, overall_score, avg_tokens, tps
- Delta table: compact vs standard — token delta %, wall delta %, t/s delta %, P@1 delta, score delta
Grammars tested: standard (no grammar), compact (CoT coding grammar — enforces
Critical endpoint rule: Always use http://localhost:8079 (model-manager proxy), NOT http://localhost:8080 (raw router). The proxy handles model loading; the raw router returns 404 for unloaded models.
Model loading before the run:
# Verify current state
curl -s http://localhost:8079/proxy/status | python3 -c "import sys,json; d=json.load(sys.stdin); print('Loaded:', d.get('loaded',[]))"
# Load models (if not already loaded)
curl -s -X POST http://localhost:8079/api/load -H 'Content-Type: application/json' -d '{"model": "qwen36-27b-Q4_0"}'
curl -s -X POST http://localhost:8079/api/load -H 'Content-Type: application/json' -d '{"model": "qwen36-27b-cot"}'
If models share VRAM, load one, unload the other, load second. Monitor with curl -s http://localhost:8079/proxy/status.
Model names: Must match [section] in router-preset.ini. The -cot variants have CoT template baked into startup args; per-request grammar injection overrides this cleanly for comparison testing. qwen36-27b-Q4_K_M is the same GGUF file as qwen36-27b-cot — use whichever is uncommented in the preset.
Model management
model_manager.py is now a passive proxy (no swap subcommand). Swap models manually via curl:
# Unload current model
curl -s -X POST http://localhost:8080/models/unload -H 'Content-Type: application/json' -d '{"model":"<current-model>"}'
# Wait for memory to free
sleep 10
# Load new model
curl -s -X POST http://localhost:8080/models/load -H 'Content-Type: application/json' -d '{"model":"<new-model>"}'
Do NOT unload a model during a running bench — it kills the model mid-problem and produces garbage results (500 errors for remaining problems).
Thinking Mode Testing
For models with configurable thinking mode (enable_thinking in chat-template-kwargs in router-preset.ini), always benchmark both modes. The effect is model-dependent and unpredictable:
- qwen36-35b (MoE 35B/3B): Thinking OFF is strictly better — +0.057 coding, +0.095 JSON, 3x faster. Thinking tokens introduced noise.
- holo3-35b (MoE 35B/3B): Thinking ON is critical — -0.122 coding without it, numerical stability collapsed from 1.00 to 0.67.
- gemma4-31b (dense 31B): Thinking ON produced reasoning_content; OFF produced direct content. Impact TBD.
Same architecture class (Qwen3.5 35B-A3B MoE), opposite outcomes. You cannot predict which mode is better without benchmarking.
Procedure:
- Set
chat-template-kwargs = {"enable_thinking":false}in router-preset.ini (for models using this mechanism) systemctl --user restart m5-router.service- Run JSON bench + coding bench
- Set
chat-template-kwargs = {"enable_thinking":true}(or remove the line if that's the default) - Restart router, run both benches again
- Compare and record both rows in
merged_bench_results.mdwith aThinkcolumn (yes/no/—)
Record both configurations in the main results table as separate rows (e.g., qwen36-35b with Think=no, qwen36-35b with Think=yes).
Toggling reasoning mode via reasoning key (Gemma, etc.)
Some models (Gemma 4) don't use chat-template-kwargs for thinking. Instead, use the reasoning key in router-preset.ini:
# Disable thinking:
reasoning = off
# Enable thinking:
reasoning = auto
Critical: reasoning = none is INVALID. The valid values are on, off, auto. Using none is silently ignored and falls back to the parent default.
Critical: parent router --reasoning flag overrides per-model INI. If start-native-router.sh has --reasoning auto, per-model reasoning = off in the INI is ignored. Fix: remove --reasoning from start-native-router.sh entirely, then add reasoning = auto to every model section in the INI. Only the model being tested gets reasoning = off.
Verification: After restarting and loading the model, check the child process args:
# Check loaded model's reasoning flag
curl -s http://localhost:8080/v1/models | python3 -c "
import sys,json
d=json.load(sys.stdin)
for m in d.get('data',[]):
if m['id']=='MODEL_NAME':
args=m.get('status',{}).get('args',[])
if '--reasoning' in args:
idx=args.index('--reasoning')
print(f'reasoning={args[idx+1]}')
"
# Or quick test:
curl -s http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"MODEL","messages":[{"role":"user","content":"2+2?"}],"max_tokens":50}' | \
python3 -c "import sys,json; m=json.load(sys.stdin)['choices'][0]['message']; print('thinking:', bool(m.get('reasoning_content','')))"
Post-benchmark variant pruning
After benchmarking multiple GGUF variants of the same model family (e.g., Qwen 3.6 27B Q4_0 vs Q4_K_M), compare results and prune the router preset to retain only the best performing variant.
Comparison metrics
Prioritize these metrics in order:
pass@1(correctness, weight 0.45 in benchmark)overall_score(weighted sum of all metrics)t/s(tokens per second, runtime efficiency)avg_tokens(token usage per problem, lower is better for speed/cost)
Pruning steps
- Open
~/llm-server/router-preset.ini - Identify all
[section]entries for the model family (e.g.,qwen36-27b-Q4_0,qwen36-27b-Q4_K_M,qwen36-27b-cot) - Comment out or delete sections for underperforming variants (retain only the best)
- Save the INI file
- Restart the router service to apply changes:
systemctl --user restart m5-router.service - Verify the proxy (port 8079) reflects updates:
curl -s http://localhost:8079/proxy/status | python3 -c "import sys,json; d=json.load(sys.stdin); print('Models configured:', len(d.get('models',[]))); [print(f' {m[\"name\"]} (loaded: {m[\"loaded\"]})') for m in d.get('models',[]) if '<model-family>' in m['name'].lower()]"
Example: Qwen 3.6 27B pruning
Benchmark results:
qwen36-27b-Q4_0: P@1=85.7%, Score=0.85, 12 t/s, 634 tok/problemqwen36-27b-Q4_K_M: P@1=85.7%, Score=0.82, 9 t/s, 1215 tok/problem
Pruning action: Retain qwen36-27b-Q4_0, remove all other 27B variants from the preset.
Notes
- The router caches
router-preset.iniat startup; restart is mandatory for changes to take effect. - Re-run benchmarks if adding new variants of the same model family later.
Cross-Model Benchmark Comparison
Use this workflow when comparing benchmark results from different model families (e.g., Qwen 3.6 27B vs Qwopus35 27B) to select the best model for a specific use case, not just variants of the same family.
When to use
- Comparing 2+ models from different architectures/families (not just quant variants of the same base model)
- Selecting a primary model for a recurring task (e.g., coding pipeline, research agent)
- Evaluating tradeoffs between accuracy, speed, token efficiency, and stability
Parse heterogeneous benchmark JSON structures
Different benchmark scripts produce different JSON schemas. Common patterns:
- Grammar bench (Qwen 3.6 27B Q4_0 style): Nested summary under
summary.<model_grammar_label>with keyspass_at_1,overall_score,avg_tokens,avg_tps - Flat bench (Qwopus 35 27B style): Flat
summaryobject withpass_at_1,overall_score; token/speed data must be calculated fromproblemsarray:import json with open('results_qwopus35_27b.json') as f: data = json.load(f) problems = data["problems"] total_tokens = sum(p.get("solve_tokens", 0) for p in problems) total_time = sum(p.get("solve_time_s", 0) for p in problems) avg_tokens = total_tokens / len(problems) avg_tps = total_tokens / total_time if total_time > 0 else 0
Metrics priority (cross-model)
Same as variant pruning, but add refactor_stability earlier for pipeline use cases:
pass@1(correctness, weight 0.45)refactor_stability(iterative coding stability, critical for feature pipelines)overall_score(weighted sum of all metrics)t/s(tokens per second, runtime efficiency)avg_tokens(token usage per problem, lower is better for speed/cost)
Conclusion framework
Draw conclusions based on use case:
- Recurring pipeline task (e.g., feature implementation): Prioritize refactor_stability > t/s > avg_tokens > pass@1. Verbose models with high token usage are penalized.
- One-off hard problem: Prioritize peak pass@1 > overall_score. Speed/token usage less critical.
- Research/reasoning task: Check if CoT variant exists, prioritize traceable reasoning over raw coding metrics.
Example: Qwen 3.6 27B Q4_0 vs Qwopus35 27B
| Metric | Qwen 3.6 27B Q4_0 | Qwopus35 27B (best) |
|---|---|---|
| Pass@1 | 85.71% | 93.33% |
| Overall Score | 0.8524 | 0.8700 |
| Avg Tokens/Problem | 634.1 | 2043.5 |
| Avg t/s | 12.1 | 11.6 |
| Refactor Stability | 83.3% | 71.4% |
Conclusion: Qwen is better for recurring coding pipelines (faster, more token-efficient, more refactor-stable). Qwopus is better for one-off hard problems with higher peak accuracy.
Pitfalls
Never run without
-uflag — Python buffers stdout when redirected; you'll see zero output for the entire run until it finishes. NOTE: even with-u, some models (e.g. carnice-35b thinking MoE) produce zero stdout while actively running. Always check the JSON output file'sprogressfield to verify the bench is advancing — it's the only reliable progress indicator.Don't restart runs unless necessary — each restart costs 30-60 min. The incremental JSON save means you can check
progressto know where it is. If you must restart, at least the partial results are preserved.Model swap time — the router needs to unload the current model and load the new one. Expect 30-90s of silence after starting a new model's run. Don't panic, just wait.
Thinking models (GLM-4, Qwen3.5, Step-3.5-Flash) — they generate all reasoning tokens before emitting any visible output. A problem can show no progress for 3-5 minutes even though the model is actively generating. Step-3.5-Flash generates
4400 tokens per solve, taking ~4 min solve + ~4 min refactor = ~8 min per problem (2 hours total).Process death diagnosis — if a bench process dies silently (no crash output, process gone), check: (a) the results JSON for progress, (b) dmesg for OOM kills, (c) test the model manually with curl to see if it's the model or the harness. Common causes: HTTP timeout too short, router 500 error (llama.cpp crash on large prompts), or OOM.
ornsteinV-27b dies at 1/15 — known issue where the process exits silently after scoring p01. Root cause unclear (not OOM, not timeout). May need investigation. Model works fine with manual curl requests.
step35-flash / step37 (Step 3.x Flash models) — these are thinking models by nature; they output to
reasoning_contentnotcontent, even whenchat-template-kwargs = {"enable_thinking":false}is set in the router preset. Theenable_thinkingflag is silently ignored by some templates — the model continues producing thinking tokens and leavingcontentempty. The harness fallback (content or reasoning_content) handles this. step35-flash is extremely slow (18 t/s, spills to RAM on Strix Halo). Needs 2+ hours for full run. Use 10800s timeout minimum. Its refactors frequently hit the 8192 token cap, producing truncated syntax errors and tanking refactor_stability to 0.33. step37 at IQ4_XS (95GB, mmap) is also slow but thinking bleed-through can be fixed via chat template override - see step37-thinking-suppression skill. After the fix, pass@1 jumps from 60% to 87% with zero syntax errors.Never unload a model during a running bench — calling
/models/unloadormodel_manager.py swapwhile a bench is running kills the model mid-problem, causing 500 errors for all remaining problems. The old results file will contain garbage partial data. Always wait for the bench to complete first.Router restart needed for INI changes — ANY change to
router-preset.ini(new model, changed params) requiressystemctl --user restart m5-router.service. The router caches the entire preset at startup. Changingno-mmaptommaporctx-sizewithout restart means the old params are still used silently.Always benchmark thinking models in both modes — thinking mode effect is model-dependent. qwen36-35b (Qwen3.6 MoE) improved significantly WITHOUT thinking: +0.057 coding, +0.095 JSON, 3x faster. Conversely, holo3-35b (same base arch, Qwen3.5 MoE) degraded severely without thinking: -0.122 coding, -0.333 numerical. Always test
enable_thinking:trueANDenable_thinking:falsebefore settling on a config. Toggle viachat-template-kwargs = {"enable_thinking":true/false}in router-preset.ini.model_manager.py swap memory wait — after unloading a large model (100 GB+), memory takes time to free. The swap command polls /proc/meminfo for up to 60s. Don't use a fixed 2s delay or it will abort on false "not enough memory". Use system MemAvailable (not VRAM/GTT) on Strix Halo — unified memory means the VRAM/GTT split is meaningless.
Sandbox
__build_class__and__name__errors — models' generated code may reference__build_class__(metaclasses) or__name__(module-level checks), which aren't in the default exec namespace. This causesNameError: __build_class__ not foundorNameError: name '__name__' is not definedand 0/3 correctness.
- Fix for
__build_class__: add"__build_class__"to thesafe_builtinsdict - Fix for
__name__: initialize thensdict with"__name__": "__main__"and"__doc__": None(NOT insafe_builtins). The exec namespace needs these defined before running user code. - Apply to both
run_bench.pyandrun_coding_grammar_bench.py.
- Batch script model swap timing — when running multiple models sequentially in a batch script, the router needs adequate time between unload and load. With only 10s sleep, models fail to load (503 Service Unavailable on all subsequent requests). Use 15s post-unload + 20s post-load + readiness verification loop (ping
/v1/modelsup to 10 times with 5s intervals). Without this, the entire bench run produces 0.0 scores on every problem. - Router 10s force-kill on large models — when loading a model after unloading another, the router force-kills the new spawn after 10 seconds if the old instance hasn't fully shut down. For 100GB+ models, cleanup takes minutes. Always unload, wait 15-30s, then load. If you see "force-killing model instance after 10 seconds timeout" in journalctl, this is the cause.
- Models >100GB need mmap=true — on 128GB Strix Halo, models >100GB fail with Vulkan
ErrorOutOfHostMemorywhen usingno-mmap = true. Switch tommap = trueand reducectx-size(e.g., 32768 instead of 131072) to fit. The model loads from disk on-demand instead of preloading all tensors into VRAM. - APU free memory is unreliably reported —
free -hand/proc/meminfoMemAvailable under-report on Strix Halo APU. Don't rely on them for capacity decisions. If the model fits on paper (~103GB model vs 128GB total), try loading it — the reported 22GB "available" was misleading.
Reference material
references/step37-bench-results.md— step37 (Step-3.7-Flash IQ4_XS) partial results and observations about thinking-mode bleed-through.
Active models (uncommented in router-preset.ini)
- qwen35-9b, qwen36-27b (MTP), qwen36-35b (MTP)
- qwen35-122b, qwopus35-27b, carnice-27b, ornsteinV-27b, harmonic-27b
- holo3-35b, qwopus-moe-35b, carnice-35b
- nemotron-120b, nemotron-cascade2-30b
- glm47-flash, minimax25, mistral4-small-119b, step35-flash
- gemma4-31b, gemma4-26b-moe
- step37 (Step-3.7-Flash IQ4_XS)
New models must be added to router-preset.ini AND the router service restarted (systemctl --user restart m5-router.service). Just adding to INI is not enough — the router returns 404 on /models/load for unknown models.
Results (55W TDP, with refactor, 15-problem bench)
| Model | Quant | pass@1 | complexity | numerical | refactor | overall | Time |
|---|---|---|---|---|---|---|---|
| nemotron-120b | Q4_K_M | 1.00 | 0.80 | 1.00 | 0.87 | 0.933 | 18 min |
| holo3-35b | Q8_0 | 1.00 | 0.73 | 1.00 | 0.87 | 0.920 | 5 min |
| holo3-35b (no-think) | Q8_0 | 0.87 | 0.85 | 0.67 | 0.69 | 0.798 | 1.8 min |
| cascade2-30b | Q4_K_M | 0.93 | 0.79 | 1.00 | 0.86 | 0.899 | 10 min |
| qwopus-moe-35b | Q4_K_M | 0.87 | 0.85 | 1.00 | 0.85 | 0.878 | 6 min |
| qwopus-moe-35b | Q8_0 | 0.93 | 0.64 | 1.00 | 0.86 | 0.870 | 8 min |
| qwopus35-27b | Q4_K_M | 0.93 | 0.79 | 1.00 | 0.71 | 0.870 | — |
| mistral4-119b | Q4_K_M | 0.80 | 0.83 | 1.00 | 0.67 | 0.810 | 4 min |
| qwen35-122b | Q4_K_M | 0.87 | 0.69 | 1.00 | 0.85 | 0.848 | — |
| harmonic-27b | Q4_K_M | 0.87 | 0.77 | 1.00 | 0.69 | 0.832 | 67 min |
| carnice-35b | Q8_0 | 0.93 | 0.79 | 1.00 | 0.86 | 0.899 | 14 min |
| mistral4-small-119b | Q4_K_M | 0.80 | 0.83 | 1.00 | 0.67 | 0.810 | 4 min |
| glm47-flash | Q8_K_XL | 0.80 | 0.58 | 1.00 | 0.83 | 0.793 | — |
| carnice-27b | Q4_K_M | 0.67 | 0.80 | 1.00 | 0.80 | 0.770 | — |
| gemma4-31b | Q4_K_M | 1.00 | 0.67 | 1.00 | 0.73 | 0.880 | 70 min |
| gemma4-31b (no-think) | Q4_K_M | 1.00 | 0.67 | 1.00 | 0.80 | 0.893 | — |
| gemma4-26b-moe | Q8_0 | 1.00 | 0.73 | 1.00 | 0.67 | 0.880 | 76 min |
| minimax25 | IQ4_XS | 0.93 | 0.64 | 1.00 | 0.64 | 0.827 | 78 min |
| qwen36-35b | Q8_0 | 1.00 | 0.80 | 1.00 | 0.80 | 0.920 | 3.5 min |
| qwen36-27b | Q4_K_M (MTP) | 0.93 | 0.86 | 1.00 | 0.79 | 0.899 | 7 min |
| step35-flash | IQ4_XS | 0.80 | 0.75 | 1.00 | 0.33 | 0.727 | 127 min |
| step37 (thinking corrupt) | IQ4_XS | 0.60 | 0.78 | 0.67 | 0.56 | 0.637 | 155 min |
| step37 (template fix) | IQ4_XS | 0.87 | 0.69 | 1.00 | 0.39 | 0.755 | 60 min |
| ornsteinV-27b | Q4_K_M | 0.80* | 0.75* | 1.00* | 1.00* | 0.71* | partial 5/15 |
Notes:
- Thinking models (harmonic, step35, step37, minimax, GLM-4) are slow — 100-250s per solve.
- step35-flash and step37 hit 8192 token cap on refactors => truncated syntax errors, worst refactor scores. step37 with default template produced Unicode artifacts (U+2014 em-dash, curly quotes) due to thinking bleed-through. Template fix resolved this: after chat template override, pass@1 went from 60% to 87%, numerical from 67% to 100%, zero syntax errors. See step37-thinking-suppression skill.
- step37's JSON bench is perfect (1.0) despite thinking issues — structured output unaffected.
- Q8_0 qwopus-moe gained pass@1 but lost complexity vs Q4_K_M — net zero.
- holo3-35b best price/performance: perfect pass@1 in 5 minutes at Q8_0.
- carnice-35b rerun (model config changed): pass@1 jumped 0.73→0.93, overall 0.826→0.899. Still hits 8192 tok cap on p14 refactor.
- mistral4-small-119b rerun at Q4_K_M: pass@1 dropped 0.93→0.80, complexity improved 0.79→0.83. Overall 0.856→0.810.
Quantization fairness note: holo3-35b (Q8_0, 34.4 GB) vs qwopus-moe-35b (Q4_K_M, 19.7 GB) is not a fair comparison. Re-bench qwopus-moe at Q8_0 for apples-to-apples.
Timing expectations
At 55W TDP, expect 5-60 min per model with refactor enabled. Small models (glm47-flash 9B) are ~14.5 t/s; large MoE models are slower. Thinking models (step35-flash, minimax25) can take 2+ hours. Fastest: holo3-35b at ~5 min.
Metrics
- pass_at_1: correctness on small + edge cases (weight 0.45)
- complexity_match_rate: empirical O() matches expected via log-log curve fitting (weight 0.20)
- numerical_stability: catastrophic cancellation, overflow detection (weight 0.15)
- refactor_stability: both original and refactored versions pass (weight 0.20)
- overall_score: weighted sum
GO thresholds: pass_at_1 >= 0.80, complexity >= 0.70, numerical >= 0.67, refactor >= 0.70, overall >= 0.75
Batch run strategy
Run models sequentially (router can only serve one at a time). Use background mode with notify_on_complete. Script to run all:
for model in glm47-flash qwopus35-27b carnice-27b ...; do
python3 -u run_bench.py --endpoint http://localhost:8080/v1 \
--model $model --out results_${model}.json
done
CRITICAL — Cross-benchmark parallelism corrupts results. Do NOT run JSON bench, coding bench, and research bench in parallel against the same router — the model can only serve one request stream at a time. Concurrent API calls cause contention, request queuing, timeout cascades, and noisy scores. Run them strictly sequentially in this order: JSON first (fastest, 5-20 min), then coding (5-120 min), then research (~5-60 min, depends on model). If launched in parallel by mistake, kill all processes first, clean up partial results (rm results_*.json bench_*.log), verify the model is idle (curl -s http://localhost:8080/v1/models | grep loaded), then restart serially. This is the only reliable pattern for clean per-benchmark results.