name: other-sub-agents description: > Delegate development sub-tasks to local or cloud LLMs. Use this skill whenever the user asks to offload work to local models, mentions Ollama or Gemini, wants to use local AI or cloud AI for coding tasks, or when a development task could benefit from LLM sub-agents (code generation, debugging, code review, documentation, architecture analysis, test writing). Also use this when the user says things like "use my local models", "run this through Ollama", "delegate to local AI", "use Gemini", "use a cloud model", or asks about what models are available.
Other Sub Agents
This skill delegates tasks to other LLM sub-agents running either locally (via Ollama) or in the cloud (via Google Gemini API). It discovers what's available, assesses each model's strengths, and routes tasks to the most appropriate model and provider.
Why This Exists
Claude acts as the orchestrator: it decides what to delegate, picks the right model, crafts the prompt, and integrates the results. Some tasks — generating boilerplate, writing tests, reviewing code snippets, drafting docs — can be handled by other models, saving API costs and leveraging the user's local GPU or free cloud tiers.
Two providers are supported:
- Ollama — local models, fully private, no rate limits, works offline
- Gemini — Google's cloud models, up to 1M token context, free tier available
How It Works
Step 0: Seed Initial Trust Scores (First Run Only)
On first use, seed trust scores from published benchmark data:
python3 ~/.claude/skills/other-sub-agents/scripts/seed_trust.py
This populates ~/.claude/skills/other-sub-agents/data/trust_scores.json with baseline scores derived from HumanEval,
SWE-bench, MBPP, BigCodeBench, MMLU, and other published benchmarks for ~35 common
models. Seed scores have count=1, so actual assessments quickly override them.
Step 1: Discover Available Models
Before delegating, check what's available across both providers:
python3 ~/.claude/skills/other-sub-agents/scripts/discover_models.py
This checks Ollama at localhost:11434 and (if GEMINI_API_KEY is set) the Gemini API.
It returns a unified report of all models with their capabilities and routing recommendations.
Options:
--provider all|ollama|gemini— which providers to check (default:all)--json— structured JSON output--task code|review|docs|architecture|test|debug|general— recommend best model for a task
Options:
--seed-new— automatically seed trust scores for any newly discovered models
If a provider is unavailable (Ollama not running, no Gemini key), discovery continues gracefully with the other provider.
When a new model appears (one not in ~/.claude/skills/other-sub-agents/data/trust_scores.json):
Option A: Automatic seeding (recommended)
Run discovery with --seed-new to auto-research and seed new models:
python3 ~/.claude/skills/other-sub-agents/scripts/discover_models.py --seed-new
Or use the dedicated research script directly:
# Auto-discover and seed all new models
python3 ~/.claude/skills/other-sub-agents/scripts/research_new_model.py --auto-discover
# Seed a specific new model
python3 ~/.claude/skills/other-sub-agents/scripts/research_new_model.py --model "new-model:14b" --provider ollama
# Preview what would be seeded (dry run)
python3 ~/.claude/skills/other-sub-agents/scripts/research_new_model.py --auto-discover --dry-run
The auto-research script estimates scores by:
- Looking for exact matches in the benchmark seed database
- Interpolating from similar models (same family, type, and size)
- Falling back to heuristic estimation based on model type, size, and family
- Scores start with count=1, so real assessments quickly override them
Option B: Manual research (for higher confidence)
- Search the web for its published benchmarks (HumanEval, MBPP, SWE-bench, MMLU, etc.)
- Use those benchmarks to estimate initial trust scores per task type (0-1 scale)
- Log seed scores via
trust_manager.pywith count=1 so real assessments override quickly - If no benchmarks exist, start with the static classification and use comparison mode to build trust data against a known model
Step 1.5: Check Trust Scores
After discovery, check historical trust scores to inform model selection:
python3 ~/.claude/skills/other-sub-agents/scripts/trust_manager.py --action rankings --task code --json
This shows all models ranked by their observed trust score for a given task type. Trust scores are built from your assessments after each sub-agent interaction.
For blended recommendations (static capabilities + trust data):
python3 ~/.claude/skills/other-sub-agents/scripts/discover_models.py --task code --with-trust --json
Trust score thresholds:
- > 0.7 (70%): Use confidently — model has proven itself on this task type
- 0.4–0.7: Use but review output carefully
- < 0.4: Consider an alternative or use comparison mode
- No data: Use static recommendation, then assess to build trust data
Step 2: Assess Model Fitness for the Task
Read ~/.claude/skills/other-sub-agents/references/model-profiles.md for detailed capability profiles.
Provider selection (Ollama first, Gemini as escalation):
| Scenario | Provider | Why |
|---|---|---|
| Private or sensitive code | Ollama | Data stays local |
| Large context needed (>32K tokens) | Gemini | 1M token window |
| No local GPU or weak hardware | Gemini | Cloud-hosted |
| Batch of similar tasks | Ollama | No rate limits |
| No suitable local model for the task | Gemini | Quality escalation |
| User explicitly requests a provider | Whichever they ask for | Respect preference |
| Neither provider available | Do it yourself in Claude | Fallback |
Model type routing:
Code-focused models (Ollama: names with coder, codellama, starcoder;
Gemini: any model) are best for:
- Writing functions, classes, or modules from specs
- Debugging and fixing code
- Code review, unit tests, refactoring, boilerplate
Large reasoning models (Ollama: 70B+ general models, gpt-oss;
Gemini: gemini-2.5-pro, gemini-3-pro-preview) are best for:
- Architecture and design discussions
- Documentation writing, planning, requirements analysis
- Explaining complex code
If no model is a strong match, don't delegate — do it yourself.
Step 3: Delegate the Task
Ollama example (local):
python3 ~/.claude/skills/other-sub-agents/scripts/agent_runner.py \
--model "qwen2.5-coder:32b" \
--task code \
--prompt "Write a function that validates email addresses using regex." \
--temperature 0.3
Gemini example (cloud):
python3 ~/.claude/skills/other-sub-agents/scripts/agent_runner.py \
--provider gemini \
--model "gemini-2.5-flash" \
--task code \
--prompt "Write a function that validates email addresses using regex." \
--temperature 0.3
With file context (works with both providers):
python3 ~/.claude/skills/other-sub-agents/scripts/agent_runner.py \
--provider gemini \
--model "gemini-2.5-pro" \
--task review \
--prompt "Review this code for bugs, performance, and style issues:" \
--file "/path/to/code.py" \
--temperature 0.4
Key flags:
--provider ollama|gemini— which provider (default:ollama)--model— model name (e.g.qwen2.5-coder:32borgemini-2.5-flash)--task— task type for default system prompt and temperature--file— include a file as context--stats— print generation stats to stderr--timeout 300— increase for large models--max-tokens 8192— max output tokens (Gemini only)--no-stream— wait for full response instead of streaming
Step 3.5: Assess the Sub-Agent's Output
After reviewing the sub-agent response, always log an assessment to build trust data:
python3 ~/.claude/skills/other-sub-agents/scripts/trust_manager.py --action log \
--model "qwen2.5-coder:32b" --provider ollama \
--task code --rating 4 \
--prompt-summary "Write email validation function"
Or combine execution with assessment in one command:
python3 ~/.claude/skills/other-sub-agents/scripts/agent_runner.py \
--model "qwen2.5-coder:32b" --task code \
--prompt "Write a function that validates email addresses." \
--assess --rating 4
Rating scale (1–5):
- 5 (Excellent): Directly usable, no or trivial edits needed
- 4 (Good): Mostly correct, minor fixes needed
- 3 (Adequate): Partially useful, needs significant editing
- 2 (Poor): Mostly wrong, major misunderstandings
- 1 (Failed): Unusable — wrong language, ignored instructions, errored out
Omit --rating to use auto-assessment (heuristic based on response length, code
blocks, and structure). Always prefer explicit ratings when you can judge quality.
Step 4: Integrate Results
After getting a response from a sub-agent:
- Review the output — always sanity-check before applying
- Apply selectively — use the good parts, fix or discard the rest
- Tell the user what was delegated, which model handled it, and which provider
Prompt Engineering for Sub-Agent Models
These models respond best to:
- Clear, specific instructions — be explicit about what you want
- Structured output requests — "Return only the function, no explanation"
- Concrete examples — show the format you expect
- Limited scope — one task per request, not multi-step workflows
- System prompts — the
--taskflag sets good defaults
Avoid:
- Vague instructions ("make this better")
- Multi-part tasks in a single prompt
- Assuming the model knows about the broader codebase
- Sending more context than needed (especially to Ollama models with limited windows)
Error Handling
Ollama errors:
- Not running: Inform user, suggest
ollama serve - Model not found: Show available models, suggest
ollama pull <model> - Timeout: Default 120s. For large models use
--timeout 300 - Out of memory: Suggest a smaller or more quantized variant
Gemini errors:
- No API key: Inform user, direct to
https://aistudio.google.com/apikey - Invalid key (401/403): Ask user to verify key
- Rate limit (429): Free tier is 10 RPM (Flash), 5 RPM (Pro). Wait or switch to Ollama
- Model not found (404): Check model name spelling
Comparison Mode (Dual-Model)
When unsure which model is better for a task, or to build trust data faster, run two models on the same task simultaneously:
python3 ~/.claude/skills/other-sub-agents/scripts/agent_runner.py \
--model "qwen2.5-coder:32b" --provider ollama \
--model-b "gemini-2.5-flash" --provider-b gemini \
--task code \
--prompt "Write a function that validates email addresses." \
--compare --no-stream
This outputs both responses as structured JSON. Review both outputs, pick the best result, and assess both models:
# Rate the winner
python3 ~/.claude/skills/other-sub-agents/scripts/trust_manager.py --action log \
--model "qwen2.5-coder:32b" --provider ollama \
--task code --rating 4 --prompt-summary "email validation"
# Rate the other
python3 ~/.claude/skills/other-sub-agents/scripts/trust_manager.py --action log \
--model "gemini-2.5-flash" --provider gemini \
--task code --rating 2 --prompt-summary "email validation"
When to use comparison mode:
- A new model has been added and you need trust data quickly
- Two models seem similarly capable for a task type
- The user wants to see alternative approaches
- Trust scores are close (within 10%) for the top two candidates
Execution strategy:
- Mixed providers (Ollama + Gemini): Runs in parallel — no contention
- Same provider (Ollama + Ollama): Runs sequentially — avoids VRAM swapping
Comparison results are saved in ~/.claude/skills/other-sub-agents/data/comparisons/ for reference.
Important Constraints
- Never send sensitive data (API keys, credentials, secrets) to any sub-agent model — Ollama logs to disk, Gemini sends data to Google
- Never log or expose
GEMINI_API_KEY— read from env var only, never print it - Don't over-delegate — if crafting the prompt takes longer than doing the task, just do it
- Always review output — treat sub-agent output as a draft, never as final
- Respect context limits — Ollama models typically have 2K-32K; Gemini supports up to 1M
- One Ollama model at a time — switching models incurs a VRAM load penalty. Batch similar tasks for the same model. Gemini has no such constraint.
- Gemini rate limits — 5-15 RPM on free tier. Don't rapid-fire requests.
- Gemini sends data to Google — do not send code containing secrets, credentials, or proprietary information that shouldn't leave the machine