1Cademy - Adjusting the Distillation Loss Coefficient

Learn Before

Knowledge Distillation Loss using KL Divergence

Activity (Process)

Adjusting the Distillation Loss Coefficient

In the combined training objective for large language models, given a dataset of input and output pairs (denoted as $\mathcal{D}$ ), the interpolation coefficient (denoted as $\lambda$ ) determines the degree of influence a smaller model has on the training process. This coefficient can be adjusted during either the pre-training or fine-tuning phases. A common strategy is to gradually decrease $\lambda$ , which reduces the reliance on the small model's knowledge distillation loss and places more emphasis on the original language modeling loss as the larger model becomes more capable.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related