mutation-testing

star 19

Validates Go test suite quality through mutation testing using go-gremlins/gremlins. Mutates production code, runs the test suite against each mutant, and reports which mutants the tests fail to kill — exposing weak assertions that line coverage cannot detect. Use when evaluating test effectiveness, validating newly written tests, or improving test quality for mission-critical code (consensus, channel state, payment flows, crypto). Triggers: "mutation test", "are these tests strong", "validate test quality", "/mutation-testing".

Roasbeef By Roasbeef schedule Updated 5/20/2026

name: mutation-testing description: 'Validates Go test suite quality through mutation testing using go-gremlins/gremlins. Mutates production code, runs the test suite against each mutant, and reports which mutants the tests fail to kill — exposing weak assertions that line coverage cannot detect. Use when evaluating test effectiveness, validating newly written tests, or improving test quality for mission-critical code (consensus, channel state, payment flows, crypto). Triggers: "mutation test", "are these tests strong", "validate test quality", "/mutation-testing".'

Mutation Testing

Mutation testing evaluates test quality by introducing small, deliberate bugs into production code (mutants) and checking whether the test suite fails. A test that passes on a mutant did not actually verify the behavior the mutant changed.

This skill is a thin orchestrator over go-gremlins/gremlins — a maintained Go mutation testing tool. The skill provides install, run, and analysis wrappers that produce machine-readable JSON for downstream tooling (notably the test-refine skill).

Why Mutation Testing

A test suite can hit 100% line coverage and still be useless: tests can execute code without asserting on its results, or assert only on side-irrelevant fields. Mutation testing closes this gap by checking whether the test suite distinguishes the original code from a mutant. See references/coverage-pitfalls.md (in the test-refine skill) for the broader context.

When to Use

  • After generating tests with test-forge or by hand — verify they have real assertions.
  • Before merging consensus / payment / crypto code — quality gate on critical paths.
  • During code review — surface weak tests in the diff.
  • As a signal source for test-refine — survivors map to weak-assertion findings.

Target efficacy (gremlins terminology: test_efficacy = killed / (killed + lived)):

Code class Target
Mission-critical (consensus, wallet, channel, crypto) 90%+
Core business logic 80–90%
General code 70–80%
Trivial/glue code run only if cheap

Workflow

1. Install gremlins (once)

~/.claude/skills/mutation-testing/scripts/install-gremlins.sh

The script pins to a known-good version (override with GREMLINS_VERSION=...). Requires go on PATH and $(go env GOPATH)/bin on PATH.

2. Run mutations

# Default: cwd, JSON to .reviews/mutations/<slug>.json
~/.claude/skills/mutation-testing/scripts/unleash.sh

# Targeted package
~/.claude/skills/mutation-testing/scripts/unleash.sh \
    --pkg ./internal/wallet \
    --output .reviews/mutations/wallet.json

# With integration tests and a config file
~/.claude/skills/mutation-testing/scripts/unleash.sh \
    --pkg ./internal/channel \
    --integration \
    --config .gremlins.yaml \
    --silent

3. Analyze survivors

~/.claude/skills/mutation-testing/scripts/analyze-survivors.sh \
    --input .reviews/mutations/wallet.json \
    --output .reviews/mutations/wallet.md

Produces a markdown report with: efficacy/coverage summary, survivors ranked by file (consensus/channel/wallet paths bubble to the top), and mutator-type breakdown.

Gremlins JSON Schema

gremlins unleash --output <file> emits a single JSON document:

{
  "go_module": "github.com/example/foo",
  "test_efficacy": 82.00,
  "mutations_coverage": 80.00,
  "mutants_total": 100,
  "mutants_killed": 82,
  "mutants_lived": 8,
  "mutants_not_viable": 2,
  "mutants_not_covered": 10,
  "elapsed_time": 123.456,
  "files": [
    {
      "file_name": "wallet.go",
      "mutations": [
        { "line": 42, "column": 8, "type": "CONDITIONALS_NEGATION", "status": "KILLED" }
      ]
    }
  ]
}

Mutation status values:

Status Meaning Action
KILLED Test suite caught the mutation Good — no action
LIVED Tests passed despite mutation Survivor — strengthen tests
NOT COVERED Mutation in code no test exercises Add a test for that path
TIMED OUT Tests timed out — implicit kill Investigate (might be perf bug)
NOT VIABLE Mutation produced uncompilable code Excluded from score
RUNNABLE Dry-run only; would be tested (only in --dry-run)

Key metrics:

  • test_efficacy = killed / (killed + lived) — quality of assertions on covered code.
  • mutations_coverage = (killed + lived) / (killed + lived + not_covered) — how much code is exercised at all.

A high mutations_coverage with low test_efficacy means tests run code without verifying its behavior — the classic "100% line coverage, 0% real testing" failure mode.

Configuration

Gremlins is configured via .gremlins.yaml (or --config <path>). Mutators ship default-on for safe operators and default-off for aggressive ones.

Default-on mutators (always enabled):

  • arithmetic-base+ - * / %
  • conditionals-boundary< <= > >=
  • conditionals-negation== !=, boolean conditions
  • increment-decrement++ --
  • invert-negatives-x+x

Default-off mutators — enable for critical packages:

  • invert-assignments+= -= *= /= etc. swaps
  • invert-bitwise& | ^ swaps
  • invert-bwassign&= |= ^= swaps
  • invert-logical&& ↔ || (security-critical: catches auth bypass mutations)
  • invert-loopctrlbreak ↔ continue
  • remove-self-assignments — drop x = x op y updates

Recommended config for consensus/wallet/payment code:

silent: false
unleash:
  workers: 0          # use all CPUs
  test-cpu: 0         # no per-test CPU pinning
  threshold:
    efficacy: 90      # fail if below 90%
    mutant-coverage: 85
mutants:
  arithmetic-base:        { enabled: true }
  conditionals-boundary:  { enabled: true }
  conditionals-negation:  { enabled: true }
  increment-decrement:    { enabled: true }
  invert-negatives:       { enabled: true }
  invert-assignments:     { enabled: true }
  invert-bitwise:         { enabled: true }
  invert-bwassign:        { enabled: true }
  invert-logical:         { enabled: true }   # critical for && / || in auth
  invert-loopctrl:        { enabled: true }
  remove-self-assignments:{ enabled: true }

See gremlins.dev configuration docs for the full schema.

Threshold Gating (CI)

For CI, use --silent and set thresholds in config or via env vars:

gremlins unleash --silent --output mutations.json ./...
# Exit nonzero if efficacy < threshold.

The unleash.threshold.efficacy and unleash.threshold.mutant-coverage keys cause gremlins to exit nonzero when the run falls below the configured percentages — wire this into your PR check.

Integration with Other Skills

test-refine

The test-refine skill consumes gremlins JSON to identify weak-assertion zones (smell S12: mutation-survivor). When invoked with --use-mutations, it calls unleash.sh and cross-references LIVED mutants with the AST smell scan.

test-forge

After test-forge generates tests, run mutation testing to validate them. LIVED mutants are direct evidence of weak assertions in the generated tests.

code-review

Include the test_efficacy delta in PR review — regression of >5% in covered code is a strong signal of weakening test quality.

Interpreting Results

High efficacy (≥90%): Tests have strong assertions. Focus remaining work on NOT COVERED mutants (uncovered code paths).

Medium (75–90%): Tests cover main paths. Survivors usually indicate boundary or error-path gaps.

Low (<75%): Significant gaps — tests likely run code without checking outputs. Pair with test-refine to identify the specific smells.

Mutator breakdown tells you the kind of weakness:

  • conditionals-boundary LIVED → missing edge tests at thresholds.
  • invert-logical LIVED → missing truth-table coverage for &&/||.
  • arithmetic-base LIVED → tests don't verify calculation results.
  • remove-self-assignments LIVED → state mutations not asserted.

Equivalent Mutants

Some LIVED mutants are semantically equivalent to the original — no test could kill them. Common cases:

  • Mutated value immediately overwritten before being read.
  • Mutation in unreachable code.
  • Operator swap in associative/commutative context with no observable difference.

When you identify an equivalent mutant, document it (e.g., a comment near the mutation site, or a project-level EQUIVALENT_MUTANTS.md) so reviewers don't waste time on it. Gremlins doesn't filter equivalents automatically.

Gremlins Limitations

From the upstream README: gremlins targets smallish Go modules (microservices). On very large modules, runs can take hours. Mitigations:

  • Per-package runs via --pkg ./internal/wallet. Don't pass ./... on a 500k-LOC monorepo.
  • Skip generated code by using build tags or running on hand-written packages only.
  • Use --workers to bound parallelism if memory is tight.
  • Use --dry-run first to preview the mutation count and skip if it's too large.

Further Reading

Install via CLI
npx skills add https://github.com/Roasbeef/claude-files --skill mutation-testing
Repository Details
star Stars 19
call_split Forks 3
navigation Branch main
article Path SKILL.md
More from Creator