Activity (Process)

Adjusting the Distillation Loss Coefficient

In the combined training objective for large language models, given a dataset of input and output pairs (denoted as D\mathcal{D}), the interpolation coefficient (denoted as λ\lambda) determines the degree of influence a smaller model has on the training process. This coefficient can be adjusted during either the pre-training or fine-tuning phases. A common strategy is to gradually decrease λ\lambda, which reduces the reliance on the small model's knowledge distillation loss and places more emphasis on the original language modeling loss as the larger model becomes more capable.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences