1Cademy - Dynamic Adjustment of the Knowledge Distillation Coefficient (λ)

Learn Before

Combined Training Objective for Knowledge Distillation

Activity (Process)

Dynamic Adjustment of the Knowledge Distillation Coefficient (λ)

The influence of the teacher model in a combined knowledge distillation objective, which can be applied during either pre-training or fine-tuning, is dynamically controlled by adjusting the interpolation coefficient, λ. A common strategy is to gradually decrease the value of λ as the student model's performance improves. This approach shifts the training focus from mimicking the teacher model towards learning directly from the ground-truth data via the standard language modeling loss.

Updated 2026-05-01

Contributors are: