name: agent-harness-engineering description: Build, audit, or improve agentic coding/research systems and their harnesses. Use when designing agent workflows, small-model scaffolds, tool schemas, progressive-disclosure context, repository knowledge maps, guardrails, validation loops, subagents, evals, recovery mechanisms, or autonomous-agent safety policies.
Agent Harness Engineering
Use this skill to design the environment around an agent: context surfaces, tools, guardrails, validation, recovery, and feedback loops. The default move is not "write a bigger prompt"; it is "make the right action legible and mechanically checkable."
Workflow
Define the job and risk.
- Goal: coding, research, operations, UI validation, data work, review, or cleanup.
- Model class: small/local, frontier, reasoning, tool-weak, or no-tool.
- Autonomy: advisory, edit-with-confirmation, autonomous in sandbox, or PR producer.
- Failure cost: low, medium, high, irreversible, or data-exfiltration risk.
Build a context map before loading details.
- Keep the always-loaded entry point short: a table of contents, not a manual.
- Point to repo-local sources of truth:
README.org,docs/,AGENTS.md, skills, blueprints, eval fixtures, generated schemas, and runbooks. - Preserve an existing canonical source of truth. Do not create mirrored docs unless there is an enforcement/update path.
- Load details only when the current step needs them.
Shape the tool surface.
- Expose the fewest tools that can complete the current step.
- Prefer deterministic tools and structured outputs.
- Use compound tools when a small model repeatedly needs the same chain
(
search_and_read,read_and_edit,write_and_test). - Add repair hints to tool failures: why it failed, what to try next, and an exact minimal example.
Add mechanical feedback loops.
- Run tests, linters, type checks, UI probes, log queries, or metrics checks as tools or hooks.
- Treat failures as prompt surface: concise, actionable, and scoped.
- Encode recurring review comments as docs, lints, checks, or skills.
Add safety and state controls.
- Use read-before-write, sandboxed filesystem policy, DLP, irreversible action checks, and confirmation only for high-signal cases.
- Keep session-local state for plans, files read, cached pure tool results, evidence, and model/tool failures.
- For multi-file edits, prefer checkpoints or rollbackable patches.
Evaluate and iterate.
- Add a small eval before claiming a harness improvement works.
- Compare profiles such as baseline, limited-tools, compound-tools, strict-write, selective-skills, and quality-monitor.
- Track pass rate, tool-call count, recovery count, validation failures, and context size.
Small-Model Biases
For small or local models, optimize for fewer choices and shorter turns:
- Inject one relevant instruction block, not all instructions.
- Show fewer tool schemas; route by task phase when possible.
- Keep the active plan visible and current.
- Prefer exact file ranges and diffs over full files.
- Cache repeated read/search results.
- Detect loops: repeated tool call, repeated edit miss, empty answer, malformed tool args, or repeated shell failure.
- On repair turns, give the model the failing command, the smallest relevant output tail, and the next action.
Repository Harness Surfaces
Use this minimal checklist when auditing a project:
- Entry map: short
AGENTS.mdor agent index with links to deeper docs. - Knowledge: one canonical source or well-indexed docs; large Org sources
should have a structure-first query tool such as
oq. - Tools: read/search/edit/shell split by role; MCP tools classified by risk.
- Validation: fast test command, lint/type command, formatting command.
- Guardrails: sandbox, secret handling, read-before-write, destructive checks.
- Feedback: evals, traces, review agents, quality sweeps, stale-doc cleanup.
- Legibility: logs, UI/browser state, screenshots, metrics, traces, schemas.
Run the bundled audit for a quick first pass:
python3 <skill-dir>/scripts/harness_audit.py /path/to/repo
Progressive References
Read only the reference needed for the current task:
references/principles.md: principles and tradeoffs for agent harness design.references/patterns.md: reusable implementation patterns and when to use them.references/examples.md: concrete examples for coding agents, research agents, UI agents, and small-model profiles.
Output Shape
When designing or auditing a harness, return:
Goal:
Model and autonomy assumptions:
Current harness inventory:
Recommended changes:
1. Highest-leverage change
2. Next change
3. Later change
Validation plan:
Risks and guardrails:
Keep recommendations mechanical. Prefer "add this check/tool/state machine" over "tell the model to be careful."