Fast and Effective On-policy Distillation from Reasoning Prefixes
#on‑policy distillation #student policy sampling #reasoning prefixes #token‑level supervision #training cost #generalisation #arXiv preprint
📌 Key Takeaways
- On‑policy distillation (OPD) samples student trajectories and uses a teacher for token‑level supervision.
- OPD provides better generalisation compared to off‑policy distillation by not depending only on verifiable terminal rewards.
- The sampling requirement of OPD incurs significant training cost, particularly for lengthy responses.
- The paper proposes a fast and effective approach that utilizes reasoning prefixes to mitigate this cost.
- Initial analysis in the study highlights the computational challenges and suggests potential efficiencies.
📖 Full Retelling
🏷️ Themes
Machine Learning, Natural Language Processing, Reinforcement Learning, Model Distillation
Entity Intersection Graph
No entity connections available yet for this article.
Deep Analysis
Why It Matters
On-policy distillation improves model generalization by supervising token-level trajectories, but it is expensive; this work proposes a faster method that retains benefits, making advanced distillation more practical for large models.
Context & Background
- Distillation transfers knowledge from teacher to student models
- On-policy distillation samples student trajectories during training, unlike off-policy which uses pre-recorded data
- Long response generation increases sampling cost, limiting scalability
What Happens Next
The proposed technique will be tested on larger benchmarks, integrated into training pipelines, and may inspire further cost‑saving distillation strategies.
Frequently Asked Questions
It is a training method where the student model generates sequences during training and the teacher supervises each token of those sequences.
By efficiently sampling only necessary prefixes and using faster token‑level supervision, it cuts the number of on‑the‑fly trajectories needed.