description: Improve test design and coverage with behavior-focused tests, useful seams, characterization tests, TDD, and test refactoring. Use when improving tests, adding coverage, refactoring brittle tests, removing test waste, or working test-first. NOT for fixing production bugs (use fixing-code), production-code refactors (use refactoring-code), or reviewing non-test code quality (use reviewing-code). name: improving-tests
Test Improvement
Improve tests so they catch real behavior regressions without blocking safe code changes. Coverage is a signal, not the goal.
Role-gated action
Detect capability from tools:
- Write-capable role: inspect tests, apply changes, and run verification.
- Read-only role: inspect supplied files/output and emit changes in the Proposed Changes contract. Apply nothing; run nothing.
- Missing key tool or permission: stop with Blocked and ask for the exact artifact, access, or approval needed.
Use an interactive question tool when available for mode selection, missing scope, missing framework approval, or unsafe test-stack choices.
Route elsewhere
Do not use this for:
- production bug fixes →
fixing-code - production-code refactors →
refactoring-code - non-test code review →
reviewing-code - new feature implementation unless the user asked for TDD
- browser-only UI investigation without a test-improvement goal →
browser-automation
References
Detect languages from files in scope and read only the matching reference:
- Go →
references/go.md - Python →
references/python.md - TypeScript →
references/typescript.md - Web →
references/web.md
Use generic rules only for unsupported languages.
Modes
review: find weak, duplicate, brittle, missing, slow, or flaky testsrefactor: simplify tests without changing covered behaviorcoverage: add tests for uncovered business behavior or error pathstdd: one red-green-refactor slice at a timefull: review, refactor, and add coverage
If mode is missing, ask one question with these options.
Choose the seam
Test through the contract that users or adjacent modules rely on:
- Public module, package, API, CLI, component, or service boundary.
- Integration seam when behavior depends on real wiring: database, filesystem, HTTP, queue, cache, framework routing, serialization, or config.
- Unit seam when behavior is pure, local, deterministic, and cheap to exercise.
Use graph tools only when available and when they help choose the seam or risk:
- GitNexus: use query/context to find flows around a behavior; use impact or detect-changes to choose regression tests for changed symbols and affected processes.
- codegraph: check freshness first; if fresh, use affected/context to find callers, high fan-in surfaces, and modules that need regression coverage.
- Stale graph indexes are not evidence. Refresh if allowed; otherwise report the gap and use search, coverage output, and source reads.
Test rules
- Test behavior, not private helpers, call counts, or layout.
- Mock only system boundaries: network, clock, randomness, filesystem, subprocesses, external services.
- Prefer real collaborators or in-memory fakes for internal domain code.
- Cover success, failure, edge, boundary, and regression cases that matter.
- Use coverage to find gaps; do not write low-value assertions just to raise a number.
- Delete shallow or duplicate tests once stronger public-boundary tests cover the behavior.
- Extract helpers only after repeated setup or assertions make tests harder to read.
- Follow project conventions before introducing new frameworks, helpers, or generators.
TDD and characterization
TDD:
- Name one behavior at the public seam.
- Write one failing test that fails for the expected reason.
- Implement the smallest passing code.
- Refactor only while green.
- Repeat one behavior at a time.
Characterization tests:
- Use before risky changes to legacy or under-specified code.
- Capture current externally visible behavior, including quirks.
- Place tests at the public boundary first; add narrower tests only when they add diagnostic value.
Review checks
Look for:
- tests coupled to private helpers, internals, or incidental call order
- mocks hiding real behavior or contracts
- duplicate scenario matrices that should be parameterized when readability stays high
- missing business, error, edge, concurrency, or permission cases
- flaky tests from time, randomness, ordering, shared state, or real external services
- slow tests that could move down a seam without losing confidence
- dead tests that cover deleted behavior or generated glue
Preferred consolidation:
- Go: table-driven tests with subtests.
- Python: parametrized pytest cases.
- TypeScript:
it.eachor equivalent project pattern.
Do not force consolidation when separate tests make distinct behavior clearer.
Verification
Run the relevant project command after changes. Examples:
go test ./...
pytest -v
bun test
Use coverage commands only when coverage mode or review needs them. Report skipped checks with exact reasons.
Output
Engineer:
TEST IMPROVEMENT COMPLETE
=========================
Mode: review | refactor | coverage | tdd | full
Tests changed: N
Waste removed: N
Coverage: before → after | not measured
Status: CLEAN | NEEDS ATTENTION
Key improvements:
- path:line — change
Verification:
- <command> — pass/fail/skipped with reason
Reviewer or blocked:
## Proposed Changes | BLOCKED
Blocker:
- <missing artifact, framework, tool, permission, or safe seam>
### Change 1: <brief description>
File: `path/to/test_file`
Action: CREATE | MODIFY | DELETE
Code: <complete test code or changed region with enough context>
Rationale: <weak, missing, brittle, slow, or duplicate test this fixes>
Verification: <command the applier should run>
If no test framework exists, ask before adding one. Do not claim clean without a passing check or an explicit skipped-check reason.