description: Improve test design, speed, and coverage with behavior-focused tests, useful seams, characterization tests, TDD, and test refactoring. Use when improving tests, optimizing slow suites, adding coverage, refactoring brittle tests, removing test waste, or working test-first. NOT for fixing production bugs (use fixing-code), production-code refactors (use refactoring-code), or reviewing non-test code quality (use reviewing-code). name: improving-tests
Test Improvement
Improve tests so they catch real behavior regressions without blocking safe code changes. Suite latency is a quality attribute. Coverage is a signal, not the goal.
Role-gated action
Detect capability from tools:
- Write-capable role: inspect tests, apply changes, and run verification.
- Read-only role: inspect supplied files/output and emit changes in the Proposed Changes contract. Apply nothing; run nothing.
- Missing key tool or permission: stop with Blocked and ask for the exact artifact, access, or approval needed.
Use an interactive question tool when available for mode selection, missing scope, missing framework approval, or unsafe test-stack choices.
Route elsewhere
Do not use this for:
- production bug fixes →
fixing-code - production-code refactors →
refactoring-code - non-test code review →
reviewing-code - new feature implementation unless the user asked for TDD
- browser-only UI investigation without a test-improvement goal →
browser-automation
References
Detect languages from files in scope and read only the matching reference:
- C# →
references/csharp.md - Go →
references/go.md; for slow-suite or feedback-loop work, also readreferences/go-performance.md - Java/Kotlin →
references/java-kotlin.md - Python →
references/python.md; for slow-suite or feedback-loop work, also readreferences/python-performance.md - TypeScript →
references/typescript.md; for slow-suite or feedback-loop work, also readreferences/typescript-performance.md - Rust →
references/rust.md - Web →
references/web.md
Use generic rules only for unsupported languages.
Modes
review: find weak, duplicate, brittle, missing, slow, or flaky testsrefactor: simplify tests without changing covered behaviorcoverage: add tests for uncovered business behavior or error pathstdd: one red-green-refactor slice at a timeperformance: measure test latency and remove speed waste without weakening behaviorfull: review, refactor, performance, and add coverage
If mode is missing, ask one question with these options.
Choose the seam
Test through the contract that users or adjacent modules rely on:
- Public module, package, API, CLI, component, or service boundary.
- Integration seam when behavior depends on real wiring: database, filesystem, HTTP, queue, cache, framework routing, serialization, or config.
- Unit seam when behavior is pure, local, deterministic, and cheap to exercise.
Use graph tools only when available and when they help choose the seam or risk:
- GitNexus: use query/context to find flows around a behavior; use impact or detect-changes to choose regression tests for changed symbols and affected processes.
- codegraph: check freshness first; if fresh, use affected/context to find callers, high fan-in surfaces, and modules that need regression coverage.
- Stale graph indexes are not evidence. Refresh if allowed; otherwise report the gap and use search, coverage output, and source reads.
Test rules
- Test behavior, not private helpers, call counts, or layout.
- Treat slow feedback as test waste. Fast tests make agents verify more often.
- Remove waits, real I/O, redundant setup, expensive collection, unbalanced parallelism, and broad default commands before reducing checks.
- Keep coverage, race, mutation, live, browser, and end-to-end modes off the hot path unless the task is about that mode.
- Mock only system boundaries: network, clock, randomness, filesystem, subprocesses, external services.
- Prefer real collaborators or in-memory fakes for internal domain code.
- Cover success, failure, edge, boundary, and regression cases that matter.
- Use coverage to find gaps; do not write low-value assertions just to raise a number.
- Delete shallow or duplicate tests once stronger public-boundary tests cover the behavior.
- Extract helpers only after repeated setup or assertions make tests harder to read.
- Follow project conventions before introducing new frameworks, helpers, or generators.
Feedback-loop performance
For performance mode or any slow-suite work:
- Measure the baseline command and wall time.
- Classify the bottleneck: discovery, import/compile/transform, setup, test body, external boundary, runner config, or parallel balance.
- Make one cluster of changes that removes waste without weakening behavior.
- Rerun the same command and record before/after time.
- Add or recommend a guard: durations output, per-test ceiling, slow marker, cache, or focused command.
Prefer focused deterministic checks during edits and the broader relevant suite before final output. Do not hide failures, delete edge cases, or skip important fast tests to make a number look better.
TDD and characterization
TDD:
- Name one behavior at the public seam.
- Write one failing test that fails for the expected reason.
- Implement the smallest passing code.
- Refactor only while green.
- Repeat one behavior at a time.
Characterization tests:
- Use before risky changes to legacy or under-specified code.
- Capture current externally visible behavior, including quirks.
- Place tests at the public boundary first; add narrower tests only when they add diagnostic value.
Review checks
Look for:
- tests coupled to private helpers, internals, or incidental call order
- mocks hiding real behavior or contracts
- duplicate scenario matrices that should be parameterized when readability stays high
- missing business, error, edge, concurrency, or permission cases
- flaky tests from time, randomness, ordering, shared state, or real external services
- slow tests that could move down a seam without losing confidence
- real sleeps, real external I/O, coverage-on-default, expensive imports, broad discovery, repeated setup, or slow transforms in the fast path
- order dependencies, leaked globals, or shared mutable resources that block safe parallelism
- dead tests that cover deleted behavior or generated glue
Preferred consolidation:
- Go: table-driven tests with subtests.
- Java/Kotlin: parameterized JUnit/Kotest cases when they stay readable.
- Rust: small case tables or focused
#[test]functions. - Python: parametrized pytest cases.
- TypeScript:
it.eachor equivalent project pattern.
Do not force consolidation when separate tests make distinct behavior clearer.
Verification
Run the relevant project command after changes. Examples:
dotnet test path/to/Tests.csproj
dotnet test path/to/App.sln
go test ./pkg/name -run TestName
go test ./...
./gradlew :module:test --tests '*FooTest'
./mvnw -q -Dtest=FooTest test
cargo test -p crate_name test_name
cargo test --all-targets
pytest -q --maxfail=1 --tb=short
pytest -q --durations=10 --durations-min=0.5
vitest run path/to/file.test.ts
bun test
Use coverage commands only when coverage mode or review needs them. Report skipped checks with exact reasons.
Output
Engineer:
TEST IMPROVEMENT COMPLETE
=========================
Mode: review | refactor | coverage | tdd | performance | full
Tests changed: N
Waste removed: N
Coverage: before → after | not measured
Performance: baseline → after | not measured
Status: CLEAN | NEEDS ATTENTION
Key improvements:
- path:line — change
Verification:
- <command> — pass/fail/skipped with reason
Reviewer or blocked:
## Proposed Changes | BLOCKED
Blocker:
- <missing artifact, framework, tool, permission, or safe seam>
### Change 1: <brief description>
File: `path/to/test_file`
Action: CREATE | MODIFY | DELETE
Code: <complete test code or changed region with enough context>
Rationale: <weak, missing, brittle, slow, or duplicate test this fixes>
Verification: <command the applier should run>
If no test framework exists, ask before adding one. Do not claim clean without a passing check or an explicit skipped-check reason.