name: nw-test-optimization description: Methodology for minimizing test count while maximizing behavioral coverage - behavior definition, anti-pattern catalog, consolidation patterns, stopping criterion, coverage-preserving validation user-invocable: false disable-model-invocation: true
Test Optimization Methodology
Mission
Minimize tests, maximize value, reduce feedback time, maintain quality. (Ale, 2026-04-28: "Bisogna minimizzare i test, massimizzare il valore per ridurre il tempo di feedback, mantenendo la qualità.")
This skill operationalizes that mission. Apply during DELIVER COMMIT, scheduled audits, or /nw-optimize-tests invocations.
1. Behavior Definition (the loose joint, closed)
The phrase "distinct behavior" in the test budget formula max_unit_tests = 2 × distinct_behaviors is the loose joint that lets test counts inflate. Close it with these rules.
1.1 What IS a behavior
A behavior is an observable outcome via a port (driving or driven):
- A return value from a driving-port call given specific inputs
- A state change visible through a driving-port query after an action
- A side effect at a driven-port boundary (call sequence + payload)
- An exception raised from a driving port
- A business invariant that holds across a class of inputs
1.2 What is NOT a behavior
- Phrase presence in markdown — that is one document contract, not one behavior per phrase
- AST shape (try/except wrapping, decorator presence, import order) — that is source structure, not runtime behavior
- Type system facts (subclass relations, attribute presence) — Python language guarantees, not domain behaviors
- Internal data shape (dataclass field assignment, dict structure) — language guarantees
- Internal method calls (mock.assert_called_with) — implementation, not outcome
- Source-shape compliance — runtime behavior on each supported environment IS the behavior, not the source it was compiled from
1.3 Counting rules — concrete cases
| Surface | Behaviors |
|---|---|
| Markdown skill with required phrases | 1 (the file conforms to the contract) |
| 5 skill files × 30 required phrases each | 5 (one per file), NOT 150 |
| Function with N input variations, same assertion shape | 1 (parametrize the variations) |
| Function with N error types, distinct messages and paths | N |
| Adapter that calls a driven port with ordered payload | 1 per call site, asserted once |
| Pure dataclass storing fields | 0 (Python guarantees this) |
| ABC with 5 abstract methods | 0 (Python guarantees abstract enforcement) |
1.4 Re-derivation procedure
Before counting tests in a target scope:
- List the driving ports the scope exposes
- For each driving port, list the observable outcomes it produces (return values, state, side effects, exceptions)
- Collapse outcomes that vary only by input value into one parametrized behavior
- Count the result. That number × 2 = budget.
If your count exceeds 2× behaviors, you have either testing theater or genuinely high behavioral surface — investigate which before adding mass.
2. Banned Anti-Patterns
Each pattern below is an automatic block at review. Counter-example shows the right test.
2.1 Language-Guarantee Tests
Tests that assert what the language already guarantees.
# BANNED — Python @abstractmethod already enforces this at instantiation
def test_config_port_interface_defines_required_methods():
assert issubclass(ConfigPort, ABC)
assert hasattr(ConfigPort, "get_timeout_threshold_default")
# CORRECT — test runtime behavior of a concrete adapter
def test_config_adapter_returns_default_when_unset():
adapter = EnvironmentConfigAdapter(env={})
assert adapter.get_timeout_threshold_default() == DEFAULT_THRESHOLD
2.2 AST-Shape Tests
Tests that parse source and assert structural shape.
# BANNED — tests source structure, not runtime behavior
def test_no_bare_typing_self_import():
src = Path("src/des/domain/value_objects.py").read_text()
tree = ast.parse(src)
# ... assert try/except wraps the import ...
# CORRECT — test runtime behavior on each supported version (matrix in CI)
def test_value_objects_import_succeeds_on_python_310():
# Run pytest under Python 3.10 in CI matrix
from des.domain.value_objects import OrderId
assert OrderId("abc").value == "abc"
Source compliance is a CI matrix concern, not a unit test.
2.3 Trivial Dataclass-Storage Tests
# BANNED — Python @dataclass guarantees field assignment
def test_turn_limit_config_stores_limits_by_task_type():
config = TurnLimitConfig(quick=20, deep=60)
assert config.quick == 20
assert config.deep == 60
# CORRECT — test the behavior that uses the config
def test_turn_counter_aborts_quick_task_at_limit():
counter = TurnCounter(TurnLimitConfig(quick=20, deep=60))
for _ in range(20):
counter.increment("quick")
assert counter.is_exhausted("quick")
If the dataclass has invariants (validation, derived fields), those ARE behaviors — test them.
2.4 Mock-Asserting-Mock
# BANNED — mock returns what you told it to; you are testing unittest.mock
def test_repository_returns_user():
mock_repo = Mock()
mock_repo.get.return_value = User(name="Alice")
result = mock_repo.get(1)
assert result.name == "Alice"
# CORRECT — test the application service that uses the repository
def test_user_service_returns_active_user():
repo = InMemoryUserRepository(users=[User(id=1, name="Alice", active=True)])
service = UserService(repo)
assert service.get_active(1).name == "Alice"
2.5 Parametrize-Inflation
One contract becomes N tests by parametrizing every variant.
# BANNED — 150 tests for "the markdown contains required phrases"
@pytest.mark.parametrize("phrase", PHRASES_30)
@pytest.mark.parametrize("skill_dir", SKILLS_5)
def test_skill_contains_phrase(skill_dir, phrase):
md = (skill_dir / "SKILL.md").read_text()
assert phrase in md # 30 × 5 = 150 test cases
# CORRECT — 1 test per file, asserts the contract once
@pytest.mark.parametrize("skill_dir", SKILLS_5)
def test_skill_contains_required_phrases(skill_dir):
md = (skill_dir / "SKILL.md").read_text()
missing = [p for p in REQUIRED_PHRASES if p not in md]
assert missing == [], f"{skill_dir.name} missing: {missing}"
Failure granularity is preserved: the assertion message names the missing phrases. Test count drops 30×.
2.6 Migration Regression Nets That Never Collapse
A migration produces a regression net (e.g. "every old skill name now exists at new location"). After 1 stable release with the migration green, the net MUST collapse to a single iteration.
# BANNED after migration is stable — 315 tests asserting filesystem invariants
@pytest.mark.parametrize("skill_name", SKILL_NAMES_149)
def test_skill_directory_exists(skill_name):
assert (SKILLS_DIR / f"nw-{skill_name}").is_dir()
# CORRECT post-stabilization — 1 test, single iteration
def test_all_canonical_skills_present():
expected = set(load_canonical_skill_names())
actual = {p.name.removeprefix("nw-") for p in SKILLS_DIR.glob("nw-*")}
missing = expected - actual
assert not missing, f"Missing skills: {sorted(missing)}"
3. Consolidation Patterns
Apply in this order. Each preserves coverage.
3.1 Parametrize Collapse
When N tests differ only by input value with the same assertion shape, collapse to one parametrized test. Failure granularity preserved by parameter ID.
3.2 Dict Iteration Collapse
When N parametrized tests assert independent membership/equality, collapse to one test iterating a dict and reporting all violations at once.
# BEFORE — 12 tests
@pytest.mark.parametrize("event,handler", [("RED", h1), ("GREEN", h2), ...])
def test_event_routes_to_handler(event, handler):
assert ROUTING[event] is handler
# AFTER — 1 test, all violations reported
def test_event_routing_table_complete_and_correct():
expected = {"RED": h1, "GREEN": h2, "COMMIT": h3, ...}
assert ROUTING == expected
3.3 Fixture Scope Promotion
Read-only fixtures used by N tests can promote to module or session scope when independence is preserved (no shared mutable state). Speeds up wall time without changing behavior coverage.
@pytest.fixture(scope="module") # was "function"
def loaded_skill_index():
return SkillIndex.load_from(SKILLS_DIR)
Audit: tests using the fixture must not mutate it. If any test mutates, scope cannot promote.
3.4 xdist_group Tagging
When same-file tests benefit from a shared expensive fixture, add @pytest.mark.xdist_group("name") so the scheduler keeps them on the same worker. Fixture setup runs once per worker instead of once per test.
@pytest.mark.xdist_group("update_check_http_server")
class TestUpdateCheckService:
# All methods share the HTTP server fixture, scheduled to one worker
...
3.5 Migration-Collapse Lifecycle
Regression nets from one-time migrations (rename, move, restructure) MUST collapse within 1 stable release after migration completion. Definition of "stable release":
- 1 release with the migration code green in CI for at least 7 days
- No follow-up bug reports referencing the migration during that window
After stabilization:
- Replace per-item parametrized tests with 1 single-iteration test reporting all violations at once
- Keep failure messages informative (set difference, dict diff)
- Document the collapse in commit body:
refactor(tests): collapse {migration} regression net (315 → 3) — stable since {date}
3.6 Cross-Tier Deduplication
If tests/<file>.py and tests/<subdir>/<file>.py are byte-identical (md5-equal), delete the less canonical one. Canonical = the tier-correct location (unit under unit/, integration under integration/).
If two files are not byte-identical but assert the same handler/service through overlapping intent, merge into the canonical tier and delete the other.
3.7 Single-Lifecycle Consolidation
When N tests share an expensive setup/teardown lifecycle (subprocess install, container start, filesystem fixture build) AND each test asserts a distinct contract on the same post-setup state, collapse to one lifecycle, N assertions instead of N lifecycles × 1 assertion.
# BEFORE — 24 tests × ~6s lifecycle each = 152s wall-clock
class TestTutorialSetupScripts:
def setup_method(self):
self.workspace = build_tutorial_workspace() # expensive
run_setup_script(self.workspace)
def test_creates_project_dir(self): assert (self.workspace / "project").is_dir()
def test_creates_config_file(self): assert (self.workspace / ".nwave/config.json").exists()
# ... 22 more independent assertions ...
# AFTER — 1 lifecycle, 24 assertions = 63s wall-clock (2.4× faster)
@pytest.fixture(scope="class")
def tutorial_workspace():
workspace = build_tutorial_workspace()
run_setup_script(workspace)
return workspace
class TestTutorialSetupScripts:
def test_creates_project_dir(self, tutorial_workspace):
assert (tutorial_workspace / "project").is_dir()
def test_creates_config_file(self, tutorial_workspace):
assert (tutorial_workspace / ".nwave/config.json").exists()
# ... 22 more, all reading the same workspace ...
Empirical anchor: tests/build/acceptance/test_tutorial_setup_scripts.py (commit defc07f0d, 2026-05-18): 152.81s → 62.87s, 2.4× faster, -90s.
Pre-conditions (HARD GATES):
- Setup is read-only for the assertions — no test mutates shared state. If even one test mutates, scope cannot promote (audit per §3.3).
- Failure granularity preserved: each assertion identifies WHAT failed (file/property/contract), not just "setup failed".
- Tests remain order-independent —
pytest-randomlymust not change outcomes (proves no hidden coupling).
Anti-pattern: do NOT collapse when assertions verify steps of a state-transition sequence (run A → assert, mutate B → assert, mutate C → assert). That is state-delta paradigm territory (§3.8), not single-lifecycle.
3.8 State-Delta Paradigm (cross-ref)
For tests that mutate user-observable state (installer, uninstaller, sync, hooks, settings.json — ~28% of suite), use the state-delta paradigm instead of per-assertion lifecycle: capture initial state once, apply operation, assert the delta (added/removed/modified) as a single matcher.
Honest gain: 13% compression / 17% wall-clock on the addressable subset (Ale 2026-05-05 revision). NOT universal — pure-function/AST/schema tests retain standard assertions (3-5× ceremony for zero gain otherwise).
See: nw-state-delta-paradigm skill (when present) and memory feedback_state_transition_test_paradigm for scope rules. Empirical anchor: Task #12 pilot (both slices).
4. Stopping Criterion Procedure
Apply when planning unit-test authoring inside RED (3-phase canon, ADR-025) or RED_UNIT (legacy 5-phase), at GREEN, and at COMMIT. Reviewer enforces at review.
- Count behaviors — list distinct port-level behaviors in the AC. Use the rules in Section 1.
- Compute budget —
budget = 2 × behavior_count. Document in commit body:Test budget: N behaviors × 2 = M unit tests. - Compare to actual — count parametrized tests as 1, count parametrize cases as the parameter id contributes. Group dict-iteration assertions as 1.
- Below budget — proceed.
- Above budget — trigger consolidation (Section 3). If consolidation reduces below budget, proceed.
- Still above after consolidation — two paths:
- Justify in commit body: cite the specific behaviors that genuinely require N tests and the consolidation patterns already applied
- Or redesign the abstraction: high test count often signals the unit under test owns too many responsibilities
- Reviewer enforcement — counts independently. Budget exceeded without justification or applied pattern citation = block.
4-bis. Paradigm-Match Decision Rule
Before authoring or migrating tests, match the test SHAPE to the right paradigm. Mismatched paradigm = ceremony without value (or correctness loss).
| Test shape | Paradigm | Empirical anchor |
|---|---|---|
| Closed-world finite input (N known files, M known event types, K known skill names) — assertion shape identical across instances | Parametrize-collapse → §3.1 / Dict-iteration → §3.2 | c2637f6c8 set-difference 155-test → 1 (8.9× faster) |
| Multi-step contract on shared expensive setup — independent assertions on post-setup read-only state | Single-lifecycle consolidation → §3.7 | defc07f0d 24-test 152s → 63s (2.4× faster) |
| User-observable state mutation (installer/uninstaller/sync/hooks/settings) — N tests verifying same lifecycle's side effects | State-delta paradigm → §3.8 | Task #12 pilot (13% compression / 17% wall-clock) |
| Unbounded input domain with universal invariant (algorithm, serialization, business rule) — "for all X in DOMAIN, P(X) holds" | Property-based testing (Hypothesis) — see nw-property-based-testing |
Standard PBT literature; nWave-internal scope = unbounded ONLY |
| Single happy-path + 1-3 sad paths with distinct error messages | Example-based unit tests, no consolidation needed | n/a — already minimal |
Falsifier-gate before adopting PBT
Closed-world finite input is NOT PBT territory. Hypothesis import (~457ms) + per-example bookkeeping is slower than @pytest.mark.parametrize when the input set is finite + enumerable. Apply the gate:
- Enumerate the input domain. Is it finite + listable (
SKILL_NAMES_149,EVENT_TYPES_5,SUPPORTED_PYTHONS_3)? → parametrize-collapse, NOT PBT. - Is the invariant value-independent (holds for ANY valid X, not specific Xs)? → PBT candidate.
- Run cost-benefit: if domain ≤ 10× the typical PBT example budget (100), parametrize wins on wall-clock + readability + shrinking-from-trivial-counterexamples cost.
Empirical anchor 2026-05-18: PBT migration attempt on 155-file closed-world skill registry was correctly aborted at recon stage by the falsifier-gate. Solution was set-difference parametrize-collapse (c2637f6c8, 5.42s → 0.71s, 8.9× faster). Documented in memory feedback_state_transition_test_paradigm (revised 2026-05-05).
Decision tree (concise)
Test shape?
├─ Same assertion, varying inputs from finite known set?
│ └─ parametrize-collapse (§3.1) OR dict-iteration (§3.2)
├─ N independent assertions on same post-setup read-only state?
│ └─ single-lifecycle consolidation (§3.7)
├─ State-mutation lifecycle with delta assertions?
│ └─ state-delta paradigm (§3.8)
├─ Universal invariant over unbounded domain?
│ └─ PBT (nw-property-based-testing)
└─ Few specific examples with distinct outcomes?
└─ example-based, no consolidation
5. Coverage-Preserving Validation
Before declaring an optimization done, prove no behavior was lost.
5.1 Baseline before optimization
uv run pytest <scope> -p no:randomly --tb=no -q | tail -3
# Record: passed count, failed count
uv run pytest <scope> --cov=<package> --cov-report=term-missing -p no:randomly | tail -20
# Record: coverage %, missing lines
5.2 Apply optimization
Apply consolidation patterns. Stage changes file-by-file (git add path/to/file).
5.3 Validate after optimization
uv run pytest <scope> -p no:randomly --tb=short
# Required: passed count >= baseline (consolidation reduces test count, not pass count semantics)
uv run pytest <scope> --cov=<package> -p no:randomly | tail -5
# Required: coverage % >= baseline
Acceptable outcomes:
- Test count drops, pass count drops by the consolidation delta, coverage % unchanged or higher
- Test count drops, pass count drops, coverage % drops by less than 0.5% AND the missing lines are demonstrably language-guarantee tests (Section 2.1) — document in commit body
Block conditions:
- Coverage drops by >= 0.5% without explicit justification
- Any test that was green pre-optimization is now red
- Any production file appears in git diff (optimizer never modifies production)
5.4 Optional mutation validation
For high-confidence optimizations on critical scopes:
uv run mutmut run --paths-to-mutate <scope>
Kill rate before optimization vs after must not regress. Loaded only when invoking nw-mutation-test skill.
6. Scope Selection Heuristics
When invoked without a specific scope, prioritize by leverage:
| Indicator | Priority | Pattern |
|---|---|---|
| Byte-identical file pairs (md5-equal) | P0 | Cross-Tier Deduplication (§3.6) |
| Single file with > 200 collected tests | P0 | Investigate parametrize-inflation, migration nets (§2.5, §3.5) |
Test class with setup_method building expensive workspace, > 10 read-only assertions |
P0 | Single-Lifecycle Consolidation (§3.7) |
| Closed-world finite domain tests (N known files × M known phrases) | P0 | Parametrize-collapse (§3.1), NOT PBT — see §4-bis falsifier-gate |
| Test file dominating slow-suite survey (top-N wall-clock) | P0 | Apply Paradigm-Match Decision Rule (§4-bis) before authoring fixes |
| Tests/function ratio > 4 in a module | P1 | Behavior re-counting, anti-pattern scan |
Files matching *_typing_compat, *_interface_*, *_abc_* |
P1 | Language-guarantee scan |
| AST-import test files | P1 | Replace with CI matrix |
| Files older than 6 months touching migration paths | P2 | Migration-collapse lifecycle check (§3.5) |
Use git log --diff-filter=A --name-only for migration-net dating, find tests/ -name '*.py' -exec wc -l {} + | sort -rn for fat files.
7. What This Methodology Does NOT Cover
- Test infrastructure (fixtures, conftest, plugins) — that is platform-architect or troubleshooter scope
- Production code refactoring —
/nw-refactorand the crafter scope - New test authoring — crafter scope (DELIVER wave)
- Adapter integration tests — different rules apply (real I/O, no parametrize-collapse there); see hexagonal-testing skill
8. Cross-References
nw-tdd-methodology— Mandate 1 (Observable Behavioral Outcomes), Mandate 5 (Parametrize Input Variations)nw-tdd-review-enforcement— reviewer block conditionsnw-mutation-test— coverage-preserving validation via mutation kill ratenw-property-based-testing— PBT paradigm, falsifier-gate for closed-world domainsnw-test-design-mandates— universe-per-layer, state-delta + Universe matrix (§263-270)nw-test-refactoring-catalog— refactoring patterns for test code structuredocs/analysis/investigation-overtesting-hypothesis-2026-04-28.md— empirical evidence (~580 removable tests, 18% of unit suite, the gap is enforcement decay + loose behavior definition)- Empirical speedup commits 2026-05-18:
c2637f6c8(parametrize-collapse 8.9×),defc07f0d(single-lifecycle 2.4×),e97c94663+a90606d6b(CVE+timeout+tiktoken)