Formula

Combined Training Objective Formula for Knowledge Distillation

To incorporate knowledge distillation loss into language modeling, one can add this auxiliary loss to the original language modeling loss to yield a combined training objective. This is formulated as a maximization problem to find the optimal student (large model) parameters θ~\tilde{\theta}:

θ~=arg maxθ(x,y)DlogPrθs(yx)λLosskd\tilde{\theta} = \argmax_{\theta} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} \log \mathrm{Pr}_{\theta}^{s}(\mathbf{y}|\mathbf{x}) - \lambda \cdot \mathrm{Loss}_{\mathrm{kd}}

In this formula, D\mathcal{D} is the set of input and output pairs, and λ\lambda is the interpolation coefficient. The first term, logPrθs(yx)\log \mathrm{Pr}_{\theta}^{s}(\mathbf{y}|\mathbf{x}), is the log-likelihood of the ground-truth data, encouraging the model to learn from actual labels. The second term, Losskd\mathrm{Loss}_{\mathrm{kd}}, is the knowledge distillation loss that pushes the student model to mimic the small teacher model. This method can be employed in either the pre-training or fine-tuning phase.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Learn After