Formula

Knowledge Distillation Loss using KL Divergence

Instead of just using small models to generate synthetic data, one can incorporate knowledge distillation loss based on these models. The knowledge distillation loss, denoted as Losskd\mathrm{Loss}_{\mathrm{kd}}, quantifies the difference between the output probability distributions of a teacher (or small) model and a student (or large) model. It is formally defined using the Kullback-Leibler (KL) divergence as:

Losskd=KL(Prw(x)  Prθs(x))\mathrm{Loss}_{\mathrm{kd}} = \mathrm{KL}(\mathrm{Pr}^{w}(\cdot|\mathbf{x})\ ||\ \mathrm{Pr}^{s}_{\theta}(\cdot|\mathbf{x}))

Here, Prw(x)\mathrm{Pr}^{w}(\cdot|\mathbf{x}) is the probability distribution produced by the teacher (or weak) model, and Prθs(x)\mathrm{Pr}^{s}_{\theta}(\cdot|\mathbf{x}) is the distribution from the student model with parameters θ\theta, given an input x\mathbf{x}. This simple loss function measures the difference between the small and large models and is minimized to encourage the large model to mimic the small model's behavior.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.3 Prompting - Foundations of Large Language Models

Related