name: mutation-testing description: Validate test suite quality by running a real mutation testing tool and triaging surviving mutants. Use after writing tests to verify assertions catch behavioral changes, when evaluating test coverage quality, or as a CI quality gate on critical modules. The AI value here is triage — classifying survivors, writing fix tests — not generating or estimating mutations. role: worker user-invocable: true
Mutation Testing
Wraps a real mutation tool (Stryker, pitest, mutmut, Stryker.NET) and adds AI triage of survivors. The tool generates mutations and reports survivors; the AI classifies survivors and writes fix tests. Never estimate or guess mutation outcomes — if no tool is available, help set one up; do not substitute reasoning for execution.
Constraints
- Always ask the user before running. Present the time estimate and scope; get explicit approval. Mutation testing can be slow — never surprise the user.
- Only run after tests exist; mutation testing validates tests, it does not replace them.
- Do not chase 100% mutation score; equivalent mutants are noise.
- Scope to changed files by default; full-codebase runs are periodic audits.
- Surviving mutants in critical paths require action; in trivial code they may be acceptable.
- Per-mutant wall-clock timeout. Every mutant run is capped at a wall-clock timeout. A run that exceeds the timeout counts as a killed mutant (matching the experiment-harness fix). Default:
timeout_seconds = max(60, suite_time_seconds × 10). Never run without a timeout — an infinite-loop mutant will hang the harness indefinitely. --workflow-managed-approvalcarve-out. When this flag is set, theStep 0confirmation prompt is skipped — but the "always ask the user before running" invariant still holds at a higher boundary. The flag is reserved for orchestrated workflows that capture operator approval once for the whole run, then propagate the consent down to each scoped invocation. The authoritative caller registry isreferences/workflow-callers.md. Today's allowed callers are/coverage-delta(Phase 4 of/test-modernize) and/quality-targets-converge(Phase 5); both inherit the workflow-level approval obtained at/test-modernizePhase 0. Any new caller must document where its workflow-level approval is captured before adopting the flag — see the registry file for the full process.
Parse Arguments
The skill accepts free-form natural-language arguments AND the following named flags for workflow callers:
--scope <files-or-globs>— restrict the mutation run to the listed files or shell globs. Comma-separated lists and quoted globs both accepted. When omitted, the skill scopes to changed files by default.--emit-json <path>— write the structured result to<path>(see## Machine-readable outputbelow) in addition to any human-readable output. The path's parent directory must be writable; failure writes to stderr and exits non-zero.--workflow-managed-approval— skip theStep 0confirmation gate because the calling workflow captured approval at a higher boundary. Restricted to the allowlist in## Constraints.
Time estimation
Use the heuristics in references/tool-setup.md. Present the estimate to the user; if > 5 minutes, suggest scoping down.
Step 0: Confirmation gate
Before any mutation run, present the estimated time and the scope, then block on stdin for explicit approval. The prompt is observable: stdout contains the literal string Estimated time: followed by the scope summary. This gate is skipped when --workflow-managed-approval is set — see the carve-out in ## Constraints.
Step 1: Detect or set up tooling
Detect and install the tool for the project's language (Stryker for JS/TS, pitest for Java/Kotlin, mutmut for Python, Stryker.NET for C#). Per-language detection and installation: references/tool-setup.md. Do not proceed without a working tool.
Step 1b: Configure per-mutant timeout
Set a per-mutant wall-clock timeout before running. A timed-out mutant is killed
(counts toward the mutation score as a non-survivor). This matches the experiment-harness
fix in docs/experiments/.
Derive the timeout:
suite_time_seconds = time the baseline test suite (from Step 1 output, or measure: `time <test-command>`)
timeout_seconds = max(60, suite_time_seconds × 10)
Per-tool configuration:
| Tool | Config | Default shipped |
|---|---|---|
| Stryker (JS/TS) | stryker.config.js: timeoutMS: <ms>, timeoutFactor: 2.5 |
60 000 ms |
| pitest (Java/Kotlin) | CLI: --timeoutConst 60 --timeoutFactor 2.5 |
60 s const |
| mutmut (Python) | CLI: --timeout <seconds> (passed to subprocess) |
60 s |
| Stryker.NET (C#) | stryker-config.yaml: timeout: 60000 |
60 000 ms |
Set the tool timeout to timeout_seconds (converting to ms for Stryker / Stryker.NET)
before running Step 2. Document the chosen timeout in the output summary.
Step 2: Run the tool (scoped to target)
Run scoped to user-specified files or changed files. Per-language commands: references/tool-setup.md. Capture full output and note HTML report paths.
Step 3: Parse results
Extract surviving mutants. Map each to:
| Field | Source |
|---|---|
| File + line | Tool report |
| Mutation operator | Tool report (ConditionalBoundary, NegateConditional, etc.) |
| Original code | Read the source at that line |
| Mutated code | Tool report or infer from operator |
| Mutation score | Tool summary |
Step 4: Triage survivors
For each survivor, classify and act:
| Classification | Meaning | Action |
|---|---|---|
| Equivalent | Mutation produces identical behavior | Mark excluded — no test can kill it |
| Missing assertion | Test executes the code but doesn't assert on affected output | Strengthen the assertion |
| Missing test case | No test exercises the mutated path | Write a new test |
| Undertested boundary | Mutation exposes a boundary/edge with no coverage | Add a boundary test |
| Acceptable risk | Trivial code where the mutation doesn't matter | Document and skip |
Triage procedure
- Read the source context — what does the code do and why.
- Check for equivalence — does the mutation actually change observable behavior? Common equivalent patterns: dead code or unreachable branches; commutative-operation reorderings; conditions redundant with other guards; logging/debug-only code.
- Find related tests — which tests cover this code; what do they assert.
- Classify — missing assertion, missing test, boundary gap, or equivalent.
- Write the fix test with RED-GREEN discipline: must fail against the mutant and pass against the original.
Weak vs strong test patterns
Most survivors come from tests that execute code without meaningfully asserting on behavior:
Arithmetic operators — beware identity values (0 for +/-, 1 for *//, "" for concat):
// WEAK: 0 is identity for addition — a + 0 === a - 0
expect(calculate(5, 0)).toBe(5); // passes with + or -
// STRONG: non-identity values distinguish operators
expect(calculate(5, 3)).toBe(8); // fails if + becomes -
Conditional boundaries — test both sides:
expect(isAdult(18)).toBe(true); // exactly at boundary
expect(isAdult(17)).toBe(false); // one below
Return values — assert on the actual return, not truthiness:
// WEAK: passes if return value changes from obj to true
expect(getUser(1)).toBeTruthy();
// STRONG: assert on shape
expect(getUser(1)).toEqual({ id: 1, name: "Alice" });
Statement deletion — verify side effects:
processOrder(order);
expect(db.save).toHaveBeenCalledWith(order); // catches removed save()
Step 5: Fix and verify
- Verify the fix test fails against the mutant — if possible, manually apply the mutation and run the test, or use the tool's re-run-specific-mutant feature.
- Re-run the mutation tool on the same scope to confirm the mutant is killed.
- Report the updated mutation score.
Output format
## Mutation Testing Results
**Tool:** Stryker 8.x | **Scope:** src/calculator.ts | **Duration:** 45s | **Per-mutant timeout:** 60s
**Score:** 82% (41 killed / 50 total, 3 equivalent, 6 survived)
### Surviving Mutants
| # | File:Line | Operator | Original | Mutated | Classification | Fix |
|---|---|---|---|---|---|---|
| 1 | calculator.ts:42 | ConditionalBoundary | `x > 0` | `x >= 0` | Missing boundary test | Add test: `expect(calc(0)).toBe(...)` |
| 2 | calculator.ts:67 | ReturnValue | `return result` | `return 0` | Missing assertion | Strengthen: assert on specific value |
### Equivalent Mutants (excluded)
| # | File:Line | Operator | Why equivalent |
|---|---|---|---|
| 1 | calculator.ts:15 | ArithmeticOperator | Dead code — branch unreachable |
### Recommended Test Additions
(Specific test code for each non-equivalent survivor)
Machine-readable output
When --emit-json <path> is set, write a structured result document to <path>. Workflow callers (/coverage-delta, /quality-targets-converge) read this document to compute deltas; downstream readers depend on the schema staying stable, so it is versioned.
Success envelope (schema_version: 1):
{
"schema_version": 1,
"tool": "stryker",
"scope": ["src/calculator.ts"],
"captured_at": "2026-06-19T14:22:08Z",
"total": 50,
"killed": 41,
"survived": 6,
"equivalent": 3,
"survivors": [
{ "file": "src/calculator.ts", "line": 42, "operator": "ConditionalBoundary", "status": "survived" },
{ "file": "src/calculator.ts", "line": 67, "operator": "ReturnValue", "status": "equivalent" }
]
}
Each entry in survivors carries file, line, operator, and status where status is "survived" or "equivalent". Callers MUST filter status: "equivalent" before computing deltas so reclassifications between runs don't show up as regressions.
Error envelopes (exit code non-zero, <path> still written for caller diagnostics):
{ "schema_version": 1, "tool": null, "error": "no_tool_installed", "language": "javascript" }
{ "schema_version": 1, "tool": "stryker", "error": "empty_scope", "scope_glob": "src/does-not-exist/*.ts" }
When <path> itself is unwritable (read-only directory, permission denied), the skill writes nothing to disk, prints the offending path to stderr, and exits non-zero. No partial JSON is left behind.
Per-tool worked examples (Stryker, pitest, mutmut, Stryker.NET) live in references/tool-setup.md under ## Machine-readable output schema.
When not to apply
- No tests exist yet → write tests first.
- No tool installed and user declines setup → explain the limitation; do not estimate.
- Prototype or spike code.