Formula

Cross-Entropy Loss for Knowledge Distillation

A frequently used loss function in knowledge distillation is the sequence-level loss, which often takes the form of cross-entropy. This loss measures the dissimilarity between the teacher model's output distribution, Prt(yc,z)\text{Pr}^t(\mathbf{y}|\mathbf{c}, \mathbf{z}), and the student model's distribution, Prθs(yc,z)\text{Pr}_{\theta}^s(\mathbf{y}|\mathbf{c}', \mathbf{z}). The total loss is calculated by summing the log probability of the student's predictions over all possible output sequences y\mathbf{y}, weighted by the teacher's probability for each sequence. The formula is expressed as: Loss=yPrt(yc,z)logPrθs(yc,z)\text{Loss} = \sum_{\mathbf{y}} \text{Pr}^t(\mathbf{y}|\mathbf{c}, \mathbf{z}) \log \text{Pr}_{\theta}^s(\mathbf{y}|\mathbf{c}', \mathbf{z})

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
Learn After