name: local-llm-bridge plugin: cli-agents description: > Local Gemma 4 12B sub-agent. Routes bounded tasks directly to the optimized llama-server at localhost:8089 — no routing proxy, no cloud API, 2–5s typical response. Use for fast, private, cost-free subtask delegation from any cloud primary agent. Part of the run_agent.py multi-LLM task router — cli=llama target. allowed-tools: Bash, Read, Write
Identity: The Local Gemma Sub-Agent Dispatcher
Dispatches bounded tasks directly to the optimized local Gemma 4 12B server at http://localhost:8089/v1/chat/completions. No routing proxy involved. Uses the run_agent.py task router with cli=llama.
[!IMPORTANT] Requires llama-server running on port 8089. Check:
curl http://localhost:8089/healthStart:./run_server.shin the local-llm-bench workspace. Thinking is disabled server-side (--reasoning off) — no special flags needed.
Why This Is Fast
The routing proxy (Mode A) carries ~29K tokens of Claude Code system prompt — at ~30 tok/s prefill that costs 60+ seconds per context boundary crossing.
This skill (Mode B) sends only the task prompt — typically 50–500 tokens. At 7+ tok/s generation on M1 Metal with a small context:
| Output length | Typical response time |
|---|---|
| 50 tokens | ~7s |
| 100 tokens | ~14s |
| 200 tokens | ~28s |
Default max_tokens=120 keeps responses terse. Override via code if needed.
Orchestration Pattern: run_agent.py
python ./scripts/run_agent.py \
<PERSONA_FILE> <INPUT_FILE> <OUTPUT_FILE> "<INSTRUCTION>" \
--cli llama --max-tokens 120
Example — code review
python ./scripts/run_agent.py \
agents/refactor-expert.md \
target.py \
review.md \
"List the top 3 issues in this code. Be terse." \
--cli llama
Example — summarize a diff (longer output)
python ./scripts/run_agent.py \
/dev/null \
changes.diff \
summary.md \
"Summarize this diff in 2 sentences. Focus on risk." \
--cli llama --max-tokens 200
Example — instruction only (no input file)
python ./scripts/run_agent.py \
/dev/null /dev/null \
answer.md \
"What is the capital of France? One word." \
--cli llama --max-tokens 10
Prompt Budget Guidelines
Keep prompts lean — this is the primary performance lever:
- Persona: 100–300 tokens (enough to set role and tone)
- Source file: keep under 2,000 tokens where possible; trim to the relevant section
- Instruction: 1–3 sentences; specific and bounded
- Expected output: terse — list form, not prose paragraphs
Avoid: pasting full file trees, long conversation histories, or open-ended "analyze everything" instructions.
Hardware Details (M1 Mac, 16GB)
| Parameter | Value |
|---|---|
| Server | llama-server :8089 |
| Model | Gemma 4 12B UD-Q4_K_XL |
| GPU offload | -ngl 99 (full Metal) |
| Flash Attention | -fa on |
| Batch sizes | -b 2048 -ub 512 |
| KV cache quant | -ctk q8_0 -ctv q8_0 |
| Thinking | disabled (--reasoning off) |
| Context | 32768 tokens (1 slot) |
Persona Registry (agents/)
| Persona | Use For |
|---|---|
security-auditor.md |
Vulnerability review, risk assessment |
refactor-expert.md |
Code cleanup, readability, DRY |
architect-review.md |
Design review, modularity check |
Co-located Scripts (scripts/)
All scripts are symlinked from the canonical plugins/cli-agents/scripts/ so the skill is self-contained when installed in isolation.
| Script | Purpose |
|---|---|
run_agent.py |
Task router — cli=llama dispatches here |
kv_cache_orchestrator.py |
KV slot save/restore for repeated persona calls |
run_server.py |
Start llama-server with authoritative parameters |
test_run_agent.py |
37 tests: command builders, isolated-flag security contract, llama payload |
routing_proxy.pyis NOT included — it is the Mode A API compatibility shim and is not part of this skill's execution path.
Smoke Test
curl http://localhost:8089/health
python ./scripts/run_agent.py /dev/null /dev/null /tmp/test.md "Say hello in one word." --cli llama
cat /tmp/test.md
Health Check
curl http://localhost:8089/health # must return {"status":"ok"}
# If down: python ./scripts/run_server.py