1Cademy - Combined Training Objective Formula for Knowledge Distillation

Learn Before

Combined Training Objective for Knowledge Distillation

Formula

Combined Training Objective Formula for Knowledge Distillation

To incorporate knowledge distillation loss into language modeling, one can add this auxiliary loss to the original language modeling loss to yield a combined training objective. This is formulated as a maximization problem to find the optimal student (large model) parameters $\tilde{\theta}$ :

$\tilde{\theta} = \argmax_{\theta} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} \log \mathrm{Pr}_{\theta}^{s}(\mathbf{y}|\mathbf{x}) - \lambda \cdot \mathrm{Loss}_{\mathrm{kd}}$

In this formula, $\mathcal{D}$ is the set of input and output pairs, and $\lambda$ is the interpolation coefficient. The first term, $\log \mathrm{Pr}_{\theta}^{s}(\mathbf{y}|\mathbf{x})$ , is the log-likelihood of the ground-truth data, encouraging the model to learn from actual labels. The second term, $\mathrm{Loss}_{\mathrm{kd}}$ , is the knowledge distillation loss that pushes the student model to mimic the small teacher model. This method can be employed in either the pre-training or fine-tuning phase.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

References

Learn Before

Related

Learn After