ff-debug

name: ff-debug description: "Bug fixing and debugging for ANY error, crash, loss divergence, gradient explosion, distributed hang, NaN, or unexpected behavior. Covers quick fixes and full protocol with 5-phase investigation. Trigger: 'fix bug', 'fix error', 'broken', 'crash', 'doesn't work', 'fails with', 'loss NaN', 'training hangs', 'OOM'."

Debug Workflow

Two Pathways

Quick Path (obvious root cause)

Use when: Error message clearly points to the issue (typo, missing import, wrong type).

Reproduce the error
Check .agents/knowledge/constraints.md for relevant constraints
Write targeted fix
Verify with test
Run /ff-review, commit

If not resolved in 15 min -> switch to Full Protocol.

Full Protocol (complex issues)

Use when:

Distributed training bugs (deadlocks, rank mismatches)
Numerical issues (NaN, loss divergence, wrong gradients)
Silent failures (training runs but produces garbage)
Multiple failed fix attempts

Full Protocol — Five Phases

Phase 1: Root Cause Investigation

Read complete error messages — Full stack traces matter, don't skim
Consult constraints — Check .agents/knowledge/constraints.md
Reproduce consistently — Isolate the exact trigger condition
Trace execution path — Follow through the 6-stage pipeline
Check recent changes — git log --oneline -10 — what changed recently?

Distributed-Specific Checklist

Does the error appear on all ranks or just one?
Is accelerator.wait_for_everyone() missing before the failure point?
Are frozen components synchronized across ranks? (Constraint #19)
Is ZeRO-3 being used? (Constraint #10 — unsupported)

Phase 2: Pattern Analysis

Find working examples — Compare with a similar model/algorithm that works
Diff analysis — What's different between working and broken paths? Compare completely — diff line by line, not skim. Include config YAML and environment vars.
Isolate variables — Change one thing at a time
Check dependencies — Different diffusers version? Different PyTorch version?

Phase 3: Hypothesis Testing

One hypothesis per iteration — Formulate a single falsifiable hypothesis
Minimal test case — Reproduce with smallest possible config
Low confidence (<80%)? — Add debug logging before applying fix

Red flags — STOP and restart from Phase 1:

"Let me just try changing X and see what happens"
"Quick fix for now, clean up later"
"It probably works, let me move on"

Verification gate — before acting on a conclusion, check:

Does the evidence actually support this cause, or just correlate?
Could a different root cause produce the same symptoms?
What observation would disprove this hypothesis? Have you looked for it?

Phase 4: Fix Implementation

Write failing test first (if possible)
Implement targeted fix — Only fix the bug, don't refactor
Check cross-algorithm impact — Does this fix break GRPO? NFT? AWM?
Check cross-model impact — Test with at least two model adapters
Before committing: run /ff-review skill.

Phase 5: Knowledge Capture

After fix is verified:

Update constraints.md if a new constraint was discovered
Add regression test if applicable
Document the root cause in the commit message
Follow fix archival process in topics/fix_patterns.md

Three-Strike Rule

If the same approach fails three times:

HALT all fix attempts
Question whether the underlying approach/architecture is wrong
Step back and re-examine: are you solving the right problem?
Report to user with analysis before continuing

Common Issue Categories

Training Loop Issues

Stage ordering violated? (Constraint #6)
Coupled/decoupled paradigm mismatch? (Constraint #7)
Component not on correct device? (Constraint #8)
Dataloader incorrectly prepared via accelerator? (Constraint #9)

Model Adapter Issues

load_pipeline() returning wrong type? (Constraint #5)
target_module_map mapping incorrect components?
_shared_fields causing data corruption? (Constraint #14)
Preprocessing modules not offloaded after Stage 1?

Reward Issues

Pointwise/Groupwise confusion? (Constraint #13)
Wrong reward shape returned?
required_fields not set correctly?
Device mismatch between reward model and generated samples?

Configuration Issues

YAML key doesn't match Pydantic field name? (Constraint #17)
Algorithm-specific args using wrong subclass? (Constraint #16)
Registry key doesn't match? (Constraint #1)

Distributed Issues

Missing synchronization barrier? (Constraint #18)
FSDP frozen components uninitialized on Rank > 0? (Constraint #19)
Mixed precision casting order incorrect? (Constraint #20) — see also topics/dtype_precision.md for precision diagnosis checklist
Using ZeRO-3? (Constraint #10 — not supported)