name: oom-recovery-playbook description: Out-of-memory and resource pressure recovery without changing the user's training objective. always: false
OOM recovery playbook
When a run fails with CUDA OOM, host OOM, or timeout:
Do first (preserves the objective)
- Reduce per-step memory — lower
per_device_train_batch_size(or equivalent) and raisegradient_accumulation_stepsso effective batch size stays the same when possible. - Checkpointing — enable gradient checkpointing / activation checkpointing if the framework supports it.
- Precision — move to bf16/fp16 only if it does not change the user’s stated precision requirement.
- Hardware — move to a larger GPU or more RAM before changing the algorithm.
Do not do without explicit user consent
- Switching full fine-tuning → LoRA/QLoRA (changes what is trained).
- Silently lowering
max_length/ context window (changes what data the model sees). - Swapping datasets or model checkpoints “because they fit”.
- Disabling logging or monitoring to hide failures.
After stabilization
Re-run a small pilot, confirm loss/metrics move as expected, then restore full scale.