1Cademy - Knowledge Distillation Loss using KL Divergence

Learn Before

Formula

Knowledge Distillation Loss using KL Divergence

Instead of just using small models to generate synthetic data, one can incorporate knowledge distillation loss based on these models. The knowledge distillation loss, denoted as $\mathrm{Loss}_{\mathrm{kd}}$ , quantifies the difference between the output probability distributions of a teacher (or small) model and a student (or large) model. It is formally defined using the Kullback-Leibler (KL) divergence as:

$\mathrm{Loss}_{\mathrm{kd}} = \mathrm{KL}(\mathrm{Pr}^{w}(\cdot|\mathbf{x})\ ||\ \mathrm{Pr}^{s}_{\theta}(\cdot|\mathbf{x}))$

Here, $\mathrm{Pr}^{w}(\cdot|\mathbf{x})$ is the probability distribution produced by the teacher (or weak) model, and $\mathrm{Pr}^{s}_{\theta}(\cdot|\mathbf{x})$ is the distribution from the student model with parameters $\theta$ , given an input $\mathbf{x}$ . This simple loss function measures the difference between the small and large models and is minimized to encourage the large model to mimic the small model's behavior.

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After