name: fixing-tests description: "Use when tests themselves are broken, test quality is poor, or user wants to fix/improve tests. Triggers: 'test is broken', 'test is wrong', 'test is flaky', 'make tests pass', 'tests need updating', 'green mirage', 'tests pass but shouldn't', 'audit report findings', 'run and fix tests'. NOT for: bugs in production code caught by correct tests (use debugging)." intro: | Diagnoses and fixes broken tests by classifying failure types (test bug vs code bug vs environment issue) and applying targeted repairs. Operates in three modes: fixing specific tests, processing green-mirage audit findings, and run-then-fix for batch repair. This core spellbook skill focuses on test reliability, not production code bugs.
Fixing Tests
Invariant Principles
- Tests catch bugs, not checkmarks. Every fix must detect real failures, not just achieve green status.
- Production bugs are not test issues. Flag and escalate; never silently fix broken behavior.
- Read before fixing. Read the test file AND the production file under test. Never guess.
- Verify proves value. Unverified fixes are unfinished fixes.
- Scope discipline. Fix tests, not features.
Inputs
| Input | Required | Description |
|---|---|---|
test_output |
No | Test failure output to analyze (for run_and_fix mode) |
audit_report |
No | Green mirage audit findings with patterns and YAML block |
target_tests |
No | Specific test files or functions to fix (for general_instructions mode) |
test_command |
No | Command to run tests; defaults to project standard |
Input Modes
Detect mode from user input, build work items accordingly.
| Mode | Detection | Action |
|---|---|---|
audit_report |
Structured findings with patterns 1-10, "GREEN MIRAGE" verdicts, YAML block | Parse YAML, extract findings. Read patterns/assertion-quality-standard.md for assertion quality gate and Full Assertion Principle. |
general_instructions |
"Fix tests in X", "test_foo is broken", specific test references | Extract target tests/files |
run_and_fix |
"Run tests and fix failures", "get suite green" | Run tests, parse failures |
If unclear: ask user to clarify target.
WorkItem Schema
interface WorkItem {
id: string; // "finding-1", "failure-1", etc.
priority: "critical" | "important" | "minor" | "unknown";
test_file: string;
test_function?: string;
line_number?: number;
pattern?: number; // 1-10 from green mirage
pattern_name?: string;
current_code?: string; // Problematic test code
blind_spot?: string; // What broken code would pass
suggested_fix?: string; // From audit report
production_file?: string; // Related production code
error_type?: "assertion" | "exception" | "timeout" | "skip";
error_message?: string;
expected?: string;
actual?: string;
}
Phase 0: Input Processing
Dispatch subagent (Task tool) with /fix-tests-parse command. Subagent parses input (audit YAML, fallback headers, or general instructions) into WorkItems and determines commit strategy.
Phase 1: Discovery (run_and_fix only)
Skip for audit_report/general_instructions modes.
pytest --tb=short 2>&1 || npm test 2>&1 || cargo test 2>&1
Parse failures into WorkItems with error_type, message, stack trace, expected/actual. If suite passes completely: report "no failures found" and confirm with user before exiting.
Phase 2: Fix Execution
Dispatch subagent (Task tool). Subagent MUST read the referenced files:
First, read these files to understand the quality requirements:
- Read the fix-tests-execute command file for the fix execution protocol
- Read patterns/assertion-quality-standard.md for the complete Assertion Strength Ladder and Full Assertion Principle
Then execute the fix protocol on these work items: [work items]
[Copy in the full Test Writer Template from skills/dispatching-parallel-agents/SKILL.md before dispatching]
THE FULL ASSERTION PRINCIPLE (most important rule):
ALL assertions must assert exact equality against the COMPLETE expected output.
This applies to ALL output -- static, dynamic, or partially dynamic.
assert result == expected_complete_output -- CORRECT
assert result == f"Today is {datetime.date.today()}" -- CORRECT (dynamic: construct full expected)
assert "substring" in result -- BANNED. ALWAYS. NO EXCEPTIONS.
assert dynamic_value in result -- BANNED. Dynamic content is no excuse for partial check.
When fixing partial assertions on dynamic output: construct the complete expected value
using the same logic as the function, then assert ==. Prefer construct-then-compare
over normalization. Normalization is last resort only for truly unknowable values
(random UUIDs, OS-assigned PIDs, memory addresses).
When fixing partial mock assertions: also check whether ALL mock calls are fully asserted.
Assert EVERY call with ALL args; verify call count. NEVER use mock.ANY -- construct
the expected argument dynamically if it is dynamic.
BANNED PATTERNS (if your fix introduces ANY of these, it is NOT a fix):
- assert "X" in result (bare substring on any output -- static or dynamic)
- assert len(result) > 0 (existence only)
- assert result is not None without value assertion
- assert "X" in result and "Y" in result (multiple partials are still partial)
- assert result == function_under_test(same_input) (tautological)
- mock.ANY in any call assertion
- assert_called() or assert_called_once() without argument verification
- Asserting only some mock calls
Every assertion must be Level 4+ on the Assertion Strength Ladder.
Replacing a Level 1 assertion with a Level 2 assertion is NOT a fix.
2.1 Assertion Quality Gate (ALL modes)
- Read
patterns/assertion-quality-standard.md- the Full Assertion Principle and Assertion Strength Ladder - Classify each new/modified assertion on the Assertion Strength Ladder
- REJECT any assertion at Level 2 (bare substring) or Level 1 (length/existence)
- REJECT any fix that moves from one BANNED level to another (Pattern 10)
- Level 3 (structural containment) requires written justification in the code
- For each new assertion, name the specific production code mutation it catches
- If you cannot name a mutation, the assertion is too weak; strengthen it
2.2 Production Bug Protocol
PRODUCTION BUG DETECTED
Test: [test_function]
Expected behavior: [what test expects]
Actual behavior: [what code does]
This is not a test issue - production code has a bug.
Options:
A) Fix production bug (then test will pass)
B) Update test to match buggy behavior (not recommended)
C) Skip test, create issue for bug
Your choice: ___
Do NOT silently fix production bugs as "test fixes."
Phase 3: Batch Processing
FOR priority IN [critical, important, minor]:
FOR item IN work_items[priority]:
Execute Phase 2
IF stuck after 2 attempts:
Add to stuck_items[]
Continue to next item
Stuck Items Report
## Stuck Items
### [item.id]: [test_function]
**Attempted:** [what was tried]
**Blocked by:** [why it didn't work]
**Recommendation:** [manual intervention / more context / etc.]
Phase 3.5: Post-Fix Adversarial Review (MANDATORY)
Dispatch subagent (Task tool):
First, read these files to understand the quality requirements:
- Read patterns/assertion-quality-standard.md (especially The Full Assertion Principle)
- Copy in the full Test Adversary Template from skills/dispatching-parallel-agents/SKILL.md
ROLE: Test Adversary. Your job is to BREAK the new/modified test assertions.
## Context
- Modified test files: [list of files changed during fix phase]
- Git diff of changes: [paste or reference the diff]
- Production files under test: [paths]
## Mandatory Checks
1. IMMEDIATE REJECTION: Flag any assertion that is:
- assert "X" in result on deterministic output (BANNED)
- assert len(x) > 0 or assert x is not None (BANNED)
- A fix that replaced one BANNED pattern with another (Pattern 10)
2. For each new/modified assertion:
- Classify on Assertion Strength Ladder (must be Level 4+)
- Determine if function under test is deterministic
- If deterministic: only Level 5 (exact equality) is acceptable
- Construct a plausible broken implementation that still passes
- Verdict: KILLED or SURVIVED
3. Overall verdict:
- Any SURVIVED or BANNED assertion: FAIL (list required re-fixes)
- All KILLED + Level 4+: PASS
Return: Per-assertion verdicts and overall PASS/FAIL.
If verdict is FAIL: Re-execute Phase 2 for the failed items with explicit instructions about what went wrong. Do NOT skip re-review.
Phase 4: Final Verification
pytest -v # or appropriate test command
Summary Report
## Fix Tests Summary
### Input Mode
[audit_report / general_instructions / run_and_fix]
### Metrics
| Metric | Value |
|--------|-------|
| Total items | N |
| Fixed | X |
| Stuck | Y |
| Production bugs | Z |
### Fixes Applied
| Test | File | Issue | Fix | Commit |
|------|------|-------|-----|--------|
| test_foo | test_auth.py | Pattern 2 | Strengthened to full object match | abc123 |
### Test Suite Status
- Before: X passing, Y failing
- After: X passing, Y failing
### Stuck Items (if any)
[List with recommendations]
### Production Bugs Found (if any)
[List with recommended actions]
Re-audit Option (if from audit_report)
Fixes complete. Re-run audit-green-mirage to verify no new mirages?
A) Yes, audit fixed files
B) No, satisfied with fixes
Special Cases
Flaky tests: Identify non-determinism source (time, random, ordering, external state). Mock or control it. Use deterministic waits, not sleep-and-hope.
Implementation-coupled tests: Identify the BEHAVIOR the test should verify. Rewrite to test through the public interface. Remove mocks of the unit under test's own internals; do not remove mocks of external services.
Missing tests entirely: Read production code. Identify key behaviors. Write tests following existing test file patterns in the codebase. Ensure tests would catch real failures.
Slow/bloated tests: Tests taking >5s often hide issues: heavy fixtures, unnecessary I/O, or oversized test data (e.g., 1024x1024 matrix where 4x4 suffices). Separate slow tests with marks (@pytest.mark.slow, @pytest.mark.integration, etc.). Shrink test inputs to the minimum that exercises the behavior. Move real I/O to integration tier. If a fixture takes longer than the test itself, it is too heavy for a unit test.
Over-Engineering
- Creating elaborate test infrastructure for simple fixes
- Adding abstraction layers "for future flexibility"
- Refactoring unrelated code while fixing tests
Under-Testing
- Weakening assertions to make tests pass
- Removing tests instead of fixing them
- Marking tests as skip without fixing
Scope Creep
- Fixing production bugs without flagging them
- Refactoring production code to make tests easier
- Adding features while fixing tests
Blind Fixes
- Applying suggested fixes without reading context
- Copy-pasting fixes without understanding them
- Not verifying fixes actually catch failures
Self-Check
- All work items processed or explicitly marked stuck
- Each fix verified to pass
- Each fix verified to catch the failure it should catch
- Each fix verified to be Level 4+ on the Assertion Strength Ladder (
patterns/assertion-quality-standard.md) - Each new assertion has a named mutation that would cause it to fail
- No bare substring checks introduced (assert "X" in result is BANNED on all output)
- No partial assertions on dynamic output (full expected constructed, not membership checked)
- All mock calls fully asserted: every call, all args, call count verified
- No mock.ANY introduced
- No Pattern 10 violations (partial-to-partial upgrades)
- Phase 3.5 adversarial review completed with PASS verdict
- Full test suite ran at end
- Production bugs flagged, not silently fixed
- Commits follow agreed strategy
- Summary report provided
Fix it. Prove it works. Move on. No over-engineering. No under-testing.