Dynamic Adjustment of the Knowledge Distillation Coefficient (λ)
The influence of the teacher model in a combined knowledge distillation objective, which can be applied during either pre-training or fine-tuning, is dynamically controlled by adjusting the interpolation coefficient, λ. A common strategy is to gradually decrease the value of λ as the student model's performance improves. This approach shifts the training focus from mimicking the teacher model towards learning directly from the ground-truth data via the standard language modeling loss.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Related
Combined Training Objective Formula for Knowledge Distillation
Dynamic Adjustment of the Knowledge Distillation Coefficient (λ)
Optimizing Student Model Training
When training a smaller 'student' model using a combined objective that learns from both a larger 'teacher' model and the ground-truth data, what is the primary role of the component that learns directly from the ground-truth data?
A student model is being trained using a combined objective that incorporates learning from both a larger 'teacher' model and the ground-truth data. Match each learning source with its primary contribution to the student model's training process.
Learn After
Optimizing a Student Model's Training
An engineer is training a student language model using a combined objective that balances learning from a teacher model's predictions (distillation loss) and learning from the ground-truth data (standard loss). The interpolation coefficient, λ, weighs the teacher's influence. The engineer observes that the student model quickly learns to mimic the teacher's output, but its performance on a validation set eventually plateaus and fails to surpass the teacher's performance, even though the student has the capacity to do better. What is the most probable cause of this issue related to the adjustment of λ?
A student model is being trained using a combined objective that includes a term for learning from a teacher model, weighted by a coefficient λ. Arrange the following training stages in the order that corresponds to a typical and effective dynamic adjustment schedule for λ, from the highest value of λ to the lowest.