name: slm-routing
version: 1.0.0
description: >
Declarative decision rubric for routing AI work between a small local
language model (SLM) and a remote frontier LLM. Closes Gap G3 of the
OS3PD manifesto v4.13.0 (Principle 5 — Minimize Campfire Impacts).
This skill is pure markdown — no runtime code. It documents the
task-triviality and token-budget thresholds at which the cheaper local
model is the better tool, and the failure modes that mandate routing
remotely. Sister of response-compression/SKILL.md along a different
axis: that skill controls what is said (verbosity); this one
controls where it is sent (compute target).
protocols:
- OS3PD-P5
agnostic: [os, project]
SLM Routing
Routing a task between a small local language model (SLM, typically 1B–8B parameters) and a remote frontier LLM (typically 70B+ or proprietary) is the FinOps and Green-IT axis of context optimization. It is complementary to verbosity compression, which is documented separately in skills/response-compression/SKILL.md.
When to Use
- Before invoking any LLM for a discrete sub-task, ask: "is this task in the SLM-eligible band?" If yes, prefer the local model.
- When
token-budget-gate.shraisesRULE-009(Token Bloat) on a sub-agent spawn, this skill provides the routing answer to the question "could a cheaper model have produced this output adequately?". - When the operator explicitly asks for "cheap mode", "local model only", or "no remote calls".
Decision Matrix
The matrix below maps an objective signal (task type + input/output token budget) to a routing recommendation. The bands are intentionally conservative — when in doubt, prefer remote.
| Task class | Input tokens | Output tokens | Route to | Rationale |
|---|---|---|---|---|
| Regex / glob construction | < 512 | < 256 | SLM | Pattern synthesis is well-represented in small-model training; the consequence of a mistake is a syntactic error caught immediately by the regex engine. |
| Single-file lint / format | < 2k | < 1k | SLM | Local style rules; the file fits in a small context window; failures are caught by the linter itself. |
| Deterministic refactor (rename, parameter extraction, type-annotation insertion) | < 4k | < 2k | SLM | The diff is mechanical; LSP tooling catches semantic errors. |
| Inline code comment for a small function | < 1k | < 256 | SLM | The comment is descriptive, not normative. |
| Commit-message draft for a < 500-line diff | < 2k | < 512 | SLM | Templates dominate the output distribution; humans review before push anyway. |
| Multi-file architectural change | any | any | remote | Cross-file reasoning saturates small models quickly; the blast radius justifies the cost. |
| Code review for security-relevant code | any | any | remote | False-negative on a security flaw is materially worse than the marginal LLM cost. |
| API design / contract authoring | any | any | remote | Long-horizon consequences; remote reasoning is worth its cost. |
| Bug investigation requiring stack traces + multiple files | any | any | remote | Joint reasoning across log + code is where frontier models still dominate. |
| Any task with input > 32k tokens | > 32k | any | remote | Most SLMs cap usable context around 32k–128k with quality degradation past 8k–16k. |
| Anything where the answer must be cited from an external source | any | any | remote | Tool-use + retrieval coordination is fragile on small models. |
Real Local SLMs (verify availability before invoking)
The following models are referenced as examples of locally-runnable SLMs that are commonly available through standard inference servers (Ollama, llama.cpp, vLLM, Hugging Face Text Generation Inference). The set is non-exhaustive and the field moves quickly — always verify against the host's installed model catalog before routing.
| Model family | Typical sizes | Strengths | Authoritative source |
|---|---|---|---|
| Phi (Microsoft) | 3.8B (phi-3-mini, phi-3.5-mini-instruct) |
General reasoning, coding | https://huggingface.co/microsoft/Phi-3-mini-4k-instruct |
| Gemma (Google) | 2B (gemma-2-2b-it), 9B |
Multilingual; safety-tuned | https://huggingface.co/google/gemma-2-2b-it |
| Qwen-Coder (Alibaba) | 1.5B / 3B / 7B (Qwen2.5-Coder-*-Instruct) |
Code synthesis; multi-language code understanding | https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct |
| Llama 3.2 (Meta) | 1B / 3B (Llama-3.2-3B-Instruct) |
Conversation, instruction-following | https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct |
| DeepSeek-Coder | 1.3B / 6.7B (deepseek-coder-6.7b-instruct) |
Code completion; lower-resource workstations | https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct |
| StarCoder2 | 3B / 7B / 15B | Code synthesis across 600+ languages | https://huggingface.co/bigcode/starcoder2-7b |
For empirical benchmarks against frontier models, consult the HumanEval, MMLU, and MT-Bench leaderboards published by the respective model authors and by independent evaluation suites such as lmsys/chatbot-arena and eqbench.
Failure Modes That Force Remote Routing
Once any of the following is true, route remotely regardless of token budget:
- Cross-file reasoning required. The model must reconcile constraints across 3+ files.
- Adversarial input handling. Security-sensitive parsing or auth boundary work.
- Multilingual + reasoning interaction. SLMs degrade more steeply on multilingual reasoning than on either dimension alone.
- Tool use orchestration. Multi-step tool-use chains (search → read → edit → verify) are still unreliable on most SLMs.
- Hard correctness budget. When the cost of a single wrong answer exceeds the budget saved by routing locally — e.g., production code review.
Anti-Patterns
- Routing by model name vibe. Don't pick a model because it has "coder" in its name — pick based on task class, input size, and required correctness.
- Always-remote. Defeats the purpose. If the SLM-eligible band fits the task, use it.
- Always-local on a small workstation. A 1B model on a laptop with 8 GB RAM will swap and the wall-clock cost dwarfs the per-token savings.
- Routing by token count alone. A 200-token security analysis is not SLM-eligible; a 6k-token regex test corpus is.
- Silent fallback chains. If an SLM returns a low-confidence answer, escalate explicitly — do not let the agent silently retry remotely with no audit trail.
Sister Skills and Cross-References
| Concern | Where to look |
|---|---|
| Why this skill exists | protocols/os3pd-manifesto.md §P5 (Gap G3) |
| Verbosity axis (complementary) | skills/response-compression/SKILL.md |
| Token-bloat detection (deterministic side) | plugin-scripts/governance/token-budget-gate.sh (RULE-009) |
| Co-validation with PII scanning | skills/pii-masking/SKILL.md (P6, Gap G1; routing must NOT bypass the PII firewall) |
| FAIR / explainability axis | skills/ttl-policy/SKILL.md + protocols/hierarchical-merge-protocol.md |
What this Skill is NOT
- ❌ A model registry. It does not list every SLM; the field changes monthly.
- ❌ A runtime router. The decision is made by the operator or the agent at invocation time — no hook intercepts and re-routes silently.
- ❌ A cost calculator. The matrix is qualitative; integrate with the host's FinOps tooling for hard numbers.
- ❌ A replacement for human judgement. The matrix is a starting point; promote tasks to remote whenever the cost of being wrong dominates the cost of remote tokens.