name: ff-debug description: "Bug fixing and debugging for ANY error, crash, loss divergence, gradient explosion, distributed hang, NaN, or unexpected behavior. Covers quick fixes and full protocol with 5-phase investigation. Trigger: 'fix bug', 'fix error', 'broken', 'crash', 'doesn't work', 'fails with', 'loss NaN', 'training hangs', 'OOM'."
Debug Workflow
Related Topics (read for numerical / consistency issues)
- NaN, loss divergence, wrong gradients ->
topics/train_inference_consistency.md - Dtype mismatch, overflow, precision ->
topics/dtype_precision.md - Frozen/flat loss or KL ≈ 0 ->
topics/autocast_param_swap.md(#20a)
Two Pathways
Quick Path (obvious root cause)
Use when: Error message clearly points to the issue (typo, missing import, wrong type).
- Reproduce the error
- Check
.agents/knowledge/constraints.mdfor relevant constraints - Write targeted fix
- Verify with test
- Run
/ff-review, commit
If not resolved in 15 min -> switch to Full Protocol.
Full Protocol (complex issues)
Use when:
- Distributed training bugs (deadlocks, rank mismatches)
- Numerical issues (NaN, loss divergence, wrong gradients)
- Silent failures (training runs but produces garbage)
- Multiple failed fix attempts
Full Protocol — Five Phases
Phase 1: Root Cause Investigation
- Read complete error messages — Full stack traces matter, don't skim
- Consult constraints — Check
.agents/knowledge/constraints.md - Reproduce consistently — Isolate the exact trigger condition
- Trace execution path — Follow through the 6-stage pipeline
- Check recent changes —
git log --oneline -10— what changed recently?
Distributed-Specific Checklist
- Does the error appear on all ranks or just one?
- Is
accelerator.wait_for_everyone()missing before the failure point? - Are frozen components synchronized across ranks? (Constraint #19)
- Is ZeRO-3 being used? (Constraint #10 — unsupported)
Phase 2: Pattern Analysis
- Find working examples — Compare with a similar model/algorithm that works
- Diff analysis — What's different between working and broken paths? Compare completely — diff line by line, not skim. Include config YAML and environment vars.
- Isolate variables — Change one thing at a time
- Check dependencies — Different diffusers version? Different PyTorch version?
Phase 3: Hypothesis Testing
- One hypothesis per iteration — Formulate a single falsifiable hypothesis
- Minimal test case — Reproduce with smallest possible config
- Low confidence (<80%)? — Add debug logging before applying fix
Red flags — STOP and restart from Phase 1:
- "Let me just try changing X and see what happens"
- "Quick fix for now, clean up later"
- "It probably works, let me move on"
Verification gate — before acting on a conclusion, check:
- Does the evidence actually support this cause, or just correlate?
- Could a different root cause produce the same symptoms?
- What observation would disprove this hypothesis? Have you looked for it?
Phase 4: Fix Implementation
- Write failing test first (if possible)
- Implement targeted fix — Only fix the bug, don't refactor
- Check cross-algorithm impact — Does this fix break GRPO? NFT? AWM?
- Check cross-model impact — Test with at least two model adapters
- Before committing: run
/ff-reviewskill.
Phase 5: Knowledge Capture
After fix is verified:
- Update
constraints.mdif a new constraint was discovered - Add regression test if applicable
- Document the root cause in the commit message
- Follow fix archival process in
topics/fix_patterns.md
Three-Strike Rule
If the same approach fails three times:
- HALT all fix attempts
- Question whether the underlying approach/architecture is wrong
- Step back and re-examine: are you solving the right problem?
- Report to user with analysis before continuing
Common Issue Categories
Training Loop Issues
- Stage ordering violated? (Constraint #6)
- Coupled/decoupled paradigm mismatch? (Constraint #7)
- Component not on correct device? (Constraint #8)
- Dataloader incorrectly prepared via accelerator? (Constraint #9)
Model Adapter Issues
-
load_pipeline()returning wrong type? (Constraint #5) -
target_module_mapmapping incorrect components? -
_shared_fieldscausing data corruption? (Constraint #14) - Preprocessing modules not offloaded after Stage 1?
Reward Issues
- Pointwise/Groupwise confusion? (Constraint #13)
- Wrong reward shape returned?
-
required_fieldsnot set correctly? - Device mismatch between reward model and generated samples?
Configuration Issues
- YAML key doesn't match Pydantic field name? (Constraint #17)
- Algorithm-specific args using wrong subclass? (Constraint #16)
- Registry key doesn't match? (Constraint #1)
Distributed Issues
- Missing synchronization barrier? (Constraint #18)
- FSDP frozen components uninitialized on Rank > 0? (Constraint #19)
- Mixed precision casting order incorrect? (Constraint #20) — see also
topics/dtype_precision.mdfor precision diagnosis checklist - Using ZeRO-3? (Constraint #10 — not supported)