1Cademy - KL Divergence Loss for Knowledge Distillation

Learn Before

Formula

KL Divergence Loss for Knowledge Distillation

In knowledge distillation, an alternative approach is to minimize the distance between the output probability distributions of the teacher and student models. A common loss function for this is the Kullback-Leibler (KL) divergence. For instance, in context distillation, the loss is defined as: $\mathrm{Loss} = \mathrm{KL}(\mathrm{P}^t \ ||\ \mathrm{P}^s_{\theta})$ where $\mathrm{P}^t = \mathrm{Pr}^{t}(\cdot|\mathbf{c},\mathbf{z})$ is the teacher model's probability distribution given the full context $\mathbf{c}$ and user input $\mathbf{z}$ , and $\mathrm{P}^s_{\theta} = \mathrm{Pr}_{\theta}^{s}(\cdot|\mathbf{c}', \mathbf{z})$ is the student model's distribution given the simplified context $\mathbf{c}'$ and user input $\mathbf{z}$ , with parameters $\theta$ .

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After