name: near-policy-distillation description: Methodology for accelerating on-policy distillation via asynchronous generation and evaluation. Decouples generation from evaluation using a near-policy buffer, achieving 2-4x throughput gains without quality degradation. category: deep-learning tags: [LLM, distillation, RL, training-efficiency, async-compute] trigger: distillation, on-policy distillation, async generation, NPD, teacher-student, throughput optimization
Near-Policy Distillation (NPD) Methodology
Overview
Near-Policy Distillation accelerates on-policy distillation by decoupling the expensive generation phase from the evaluation/update phase using an asynchronous buffer of near-policy samples.
Core Technique
- Asynchronous Generation Buffer: A separate worker continuously generates samples from a slightly stale (near-policy) version of the model
- Decoupled Evaluation: The main training loop consumes buffered samples for distillation updates without waiting for generation
- Near-Policy Guarantee: The buffer is managed to ensure samples are close enough to the current policy (bounded KL divergence) to maintain on-policy quality
Key Benefits
- 2-4x throughput improvement over synchronous on-policy distillation
- No quality degradation — maintains distillation fidelity via KL-bounded staleness
- GPU utilization — generation and evaluation can run on separate devices or time-sliced
Implementation Steps
- Spawn async generation worker with delayed policy snapshot
- Maintain circular buffer of (prompt, response, logprob) tuples
- Main loop consumes buffer entries, applies KL penalty if policy drift exceeds threshold
- Refresh generation worker's policy snapshot periodically
- Tune buffer size and refresh frequency based on GPU memory and staleness tolerance
Pitfalls
- Buffer staleness must be monitored — excessive KL divergence degrades distillation quality
- Memory management: large buffers can exceed GPU/CPU memory on long-context models
- Requires careful synchronization to avoid race conditions between buffer write/read
Activation Keywords
distillation, on-policy distillation, async generation, NPD, teacher-student, throughput optimization, generation buffer