Formula

General Loss Function for Knowledge Distillation

The general loss function for knowledge distillation, often written as LossLoss for simplicity, measures the discrepancy between a teacher and student model for a given input x\mathbf{x}. The function is formally expressed as Loss(Prt(),Prθs(),x)Loss(\text{Pr}^t(\cdot|\cdot), \text{Pr}_{\theta}^s(\cdot|\cdot), \mathbf{x}), where Prt()\text{Pr}^t(\cdot|\cdot) is the probability distribution of the pre-trained teacher model, and Prθs()\text{Pr}_{\theta}^s(\cdot|\cdot) is the distribution of the student model with parameters θ\theta. The training objective is to minimize this loss, thereby teaching the student to replicate the teacher's behavior.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.3 Prompting - Foundations of Large Language Models