name: local-model-route description: > Routes any task to the correct local Ollama model based on task type, context size, and privacy requirements. Prevents GPU overload on 12GB VRAM systems. ug-ug: lite model: sonnet gstack_stage: Think
SKILL: local-model-route
Bot: any
Role: Given a task description and constraints, select the correct local Ollama model and emit a ready-to-use routing decision with ollama run or API call pattern.
Ug-ug mode: lite
Status: beta
Parallelizable: yes — no shared mutable state detected (auto-inferred; verify)
Model: sonnet
Tool compatibility: Claude Code · OpenCode · Aider · n8n HTTP node
When to invoke
Trigger phrases: "which model should I use?", "route this task", "what ollama model for X", "local model for [task]", "offload to ollama"
Invoke before any local Ollama session when the right model isn't obvious.
Routing table
| Task type | Model | Size | Notes |
|---|---|---|---|
| Tab autocomplete | qwen2.5-coder:3b |
1.9GB | Always-on, fastest |
| Boilerplate / light code transforms | qwen2.5-coder:3b |
1.9GB | Low latency |
| Code edit, refactor, focused build | qwen2.5-coder:7b |
4.7GB | Main coding sweet spot |
| PRD draft, PM summary, scoped estimate | qwen3:8b |
5.2GB | Best general local model |
| Estimate sanity check, spec critique | deepseek-r1:8b |
4.9GB | Slower; use for review only |
| 14B edge sessions (short, focused) | qwen2.5-coder:14b |
9.0GB | Edge of comfort; close other models first |
| Embeddings (Mem0, Qdrant, repo search) | nomic-embed-text |
274MB | Always-on alongside gen model |
| Higher-quality embeddings | qwen3-embedding:8b |
4.7GB | Run separately from large gen jobs |
Hard limits on 12GB VRAM:
- Never run 14B + any other gen model simultaneously
- Default context: 4k (Ollama default under 24GB VRAM). Raise to 16k only when OpenCode tool use or PRD work needs it:
OLLAMA_CONTEXT_LENGTH=16384 ollama serve - 30B+ models: not viable locally
Step-by-step execution
Step 1 — Classify the task
task_type: autocomplete | code-edit | prd-summary | estimate-review | embeddings
context_needed: small (<4k) | medium (4k–16k) | large (>16k)
privacy: local-only | cloud-ok
concurrent_jobs: list any other models currently loaded
Step 2 — Apply routing logic
IF privacy == local-only → always route local regardless of task
IF context_needed == large AND task != embeddings → warn: may need cloud exception path
IF concurrent_jobs includes 14b model → do not add another gen model
ELSE → pick from routing table above
Step 3 — Emit decision
MODEL: qwen3:8b
CONTEXT: 16384 (raise only if needed)
PATTERN: ollama run qwen3:8b
OR: curl http://localhost:11434/api/generate -d '{"model":"qwen3:8b","keep_alive":"30m"}'
REASON: PRD draft — general reasoning, no long context needed
Cloud exception path
Only use when:
- Context clearly exceeds 16k–32k effective local capacity
- Multi-repo architectural reasoning required
- Deliverable has zero margin for error
Never send: raw secrets, broad repo snapshots, PII, client-confidential code. Send: Repomix-compressed slices, redacted excerpts, issue-specific context only.
Key rules
- Autocomplete model stays loaded all day — 3B never competes with main model
- nomic-embed-text stays loaded alongside any gen model (only 274MB)
- deepseek-r1:8b is for review passes only — not interactive sessions
- Never recommend cloud for proprietary client code (PrevenDebt, WebMeet, etc.)
Handoffs
| Next step | Where |
|---|---|
| Terminal agent loop | skills/opencode-aider/SKILL.md |
| Repo context packaging | skills/repomix-pack/SKILL.md |
| Memory system bootstrap | skills/mem0-qdrant/SKILL.md |
Permissions
| Type | Pattern | Why |
|---|---|---|
| Filesystem | <workspace>/... |
Referenced in skill body |
| Network | https://localhost/* |
Referenced in skill body |