Formula

KL Divergence Loss for Knowledge Distillation

In knowledge distillation, an alternative approach is to minimize the distance between the output probability distributions of the teacher and student models. A common loss function for this is the Kullback-Leibler (KL) divergence. For instance, in context distillation, the loss is defined as: Loss=KL(Pt  Pθs)\mathrm{Loss} = \mathrm{KL}(\mathrm{P}^t \ ||\ \mathrm{P}^s_{\theta}) where Pt=Prt(c,z)\mathrm{P}^t = \mathrm{Pr}^{t}(\cdot|\mathbf{c},\mathbf{z}) is the teacher model's probability distribution given the full context c\mathbf{c} and user input z\mathbf{z}, and Pθs=Prθs(c,z)\mathrm{P}^s_{\theta} = \mathrm{Pr}_{\theta}^{s}(\cdot|\mathbf{c}', \mathbf{z}) is the student model's distribution given the simplified context c\mathbf{c}' and user input z\mathbf{z}, with parameters θ\theta.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related