category: mlops name: router-troubleshooting description: Fix m5-router.service issues - orphaned INI entries, rogue llama-server processes blocking port 8080
Router Troubleshooting
When router won't start or GUI shows no models
Quick diagnostics
# Check service status
systemctl status m5-router.service
# Find what's on port 8080
lsof -i :8080
# Also check router port 8079 for alternative config
lsof -i :8079
Common issues & fixes
1. Unrecognized INI options (most common crash cause)
llama.cpp router strictly validates preset options. Any unknown key = value line causes immediate crash with:
main: failed to initialize router models: option 'X' not recognized in preset 'Y'
This puts the service in a crash loop (systemd restarts it repeatedly).
How to diagnose:
journalctl --user -u m5-router.service --since "10 min ago" --no-pager | grep "not recognized"
Valid preset options (as of llama.cpp b8920): model, ctx-size, cache-type-k, cache-type-v, n-gpu-layers, n-gpu-layers-draft, flash-attn, no-mmap, mmap, jinja, chat-template-kwargs, chat-template-file, temp, top-p, top-k, min-p, repeat-penalty, presence-penalty, batch-size, ubatch-size, threads, mmproj, kv-unified, no-warmup, load-on-startup, draft-max, draft-min, draft-p-min, model-draft, cache-type-k-draft, cache-type-v-draft, rope-scale, rope-freq-base, reasoning, reasoning-budget.
⚠️ Custom/approximate options like max-reasoning-tokens, draft-context are NOT valid and will crash the router. Any dynamic tuning must be done in the gateway proxy (model_manager.py on :8079), not in the preset INI.
⚠️ start-native-router.sh validator is stricter than llama-server: The script has a hardcoded KNOWN_KEYS list that can lag behind llama-server's actual supported flags. If validation fails with "Unknown preset keys" but the key is valid (check llama-server --help), add it to KNOWN_KEYS in the script. Example: n-gpu-layers-draft was missing despite being a valid flag.
Specific pitfall — reasoning options
The router uses reasoning and reasoning-budget to control CoT output. Common incorrect variants:
max-reasoning-tokens→ usereasoning-budgetdraft-context→ usedraft-maxanddraft-minfor speculative decoding If you're copying options from other contexts (OpenWebUI, llama-server CLI), double-check they match the preset schema above.
2. Orphaned INI entries
The router-preset.ini can accumulate bare filenames without key = value pairs.
# Check for orphan lines
grep -v '^#' /home/cricri/llm-server/router-preset.ini | grep -v '=' | head
Fix: Remove any line that's just a filename (no equals sign). The router only recognizes key = value format.
2. Rogue llama-server process hogging port 8080 A standalone server can block the router from binding.
# Kill all llama-server processes and free port
pkill -f "llama-server" && lsof -ti:8080 | xargs kill -9
3. Port 8079 conflict (secondary router) Sometimes a secondary router instance runs on 8079, causing duplicate model loads.
# Identify and kill rogue processes
lsof -ti:8079 | xargs kill -9
Full recovery sequence
# 1. Stop everything
systemctl stop m5-router.service
lsof -ti:8080 | xargs kill -9
lsof -ti:8079 | xargs kill -9
# 2. Fix INI if needed (see above)
# 3. Start fresh
systemctl start m5-router.service
sleep 5
systemctl status m5-router.service
Model backend stuck in thinking loop
Models with thinking mode (e.g., qwen36-35b with enable_thinking: true) can get stuck — all slots show is_processing: true with ~190 tokens decoded but ~9800 remaining. The model appears "loaded" but requests hang.
Model backend stuck after large-context request (silent hang)
A failure mode distinct from the thinking loop, caused by a sync request with a long prompt context (3000+ prompt tokens). The model generates at ~20 t/s but takes minutes, saturating the GPU at 100%. Model-manager's retry logic compounds the problem.
Root cause — OpenWebUI "Suggested Questions" feature:
After every assistant response, OpenWebUI's frontend automatically sends a sync request to generate 3-4 follow-up questions using the full conversation history as prompt (typically 3000-5000+ prompt tokens). This is a client-side setting stored in browser localStorage:
- OpenWebUI UI: Settings → Interface → "Show suggested questions after chat response"
- No server-side config toggle — must be unchecked in each browser
When disabled, OpenWebUI stops sending these post-response requests entirely.
Symptom sequence:
- User's task completes fine (stream or sync)
- OpenWebUI frontend sends a new sync request with the full conversation history (~3455 prompt tokens) for follow-up question generation
- Backend starts a massive generation — slot shows
n_decodedclimbing past 5000-8000 tokens - GPU pegs at 100%, request takes 2+ minutes
- Model-manager hits
Proxy error: timed outafter 2 minutes - Model-manager enters a retry loop — sends a new request every ~5 seconds, all pile up behind the stuck slot
- After 6+ minutes, model-manager logs:
Unloading X for swap → Yand force-kills the backend - Router logs:
force-killing model instance name=X after 10 seconds timeoutthenexited with status 1
Diagnosis — trace the chain:
# 1. Check model-manager for the timeout + retry + swap pattern
journalctl --user -u model-manager.service --since "30 min ago" --no-pager | grep -E "Proxy error|timed out|Unloading"
# 2. Check model-manager for request activity — look for "→ model sync" followed by timeout
# The sync after a "◆ model reasoning:" line is the follow-up question request
journalctl --user -u model-manager.service --since "30 min ago" --no-pager | grep -v "State refreshed"
# 3. Check router for proxy_reques frequency (5s intervals = retry loop, 50-100+ = stuck)
journalctl --user -u m5-router.service --since "30 min ago" --no-pager | grep "proxy_reques" | grep -o "model [a-z0-9-]*" | sort | uniq -c | sort -rn
# 4. Check backend slots for the stuck task
# Find the backend port:
ps aux | grep llama-server | grep "alias <model-name>"
# Check slots — look for is_processing=true with large n_decoded:
curl -s http://<backend-port>/slots | python3 -c "import json,sys; d=json.load(sys.stdin); [print(f'slot {s[\"id\"]}: processing={s[\"is_processing\"]}, decoded={s[\"next_token\"][0][\"n_decoded\"]}') for s in d]"
# 5. Verify VRAM saturation
cat /sys/class/drm/card*/device/mem_info_vram_used
# Total VRAM (512MB on Strix Halo):
cat /sys/class/drm/card*/device/mem_info_vram_total
The retry loop pattern: 60 proxy requests in ~6 minutes (5s intervals) to the stuck model, plus 30+ simultaneous requests to another model as model-manager tries to swap.
Recovery:
# Kill the stuck backend process by alias
ps aux | grep llama-server | grep "alias <model-name>" | awk '{print $2}' | xargs kill -9
# Or restart the router to clear all backends
systemctl --user restart m5-router.service
Prevention:
- Disable "Suggested Questions" in OpenWebUI: Settings → Interface → uncheck "Show suggested questions after chat response"
- The
reasoning_budgetandmax_tokensinjections in model-manager don't prevent this — the hang is from large-prompt prefill time, not generation length - Follow-up question requests should ideally use streaming (not sync) to avoid blocking slots
- The specific sync-after-completion signature (
→ model syncimmediately after◆ model reasoning:) is the hallmark of OpenWebUI follow-up generation
Reference: references/openwebui-followup-request-flood.md — full log signatures and timeline from an observed incident on step37.
Verification
- Check logs:
journalctl -u m5-router.service -f - Router should show all models in GUI on port 8080
- No duplicate processes on port 8080 (only router, not standalone llama-server)
- Ensure port 8079 is clear if not used