1Cademy - General Loss Function for Knowledge Distillation

Learn Before

Teacher-Student Model Architecture in Knowledge Distillation

Formula

General Loss Function for Knowledge Distillation

The general loss function for knowledge distillation, often written as $Loss$ for simplicity, measures the discrepancy between a teacher and student model for a given input $\mathbf{x}$ . The function is formally expressed as $Loss(\text{Pr}^t(\cdot|\cdot), \text{Pr}_{\theta}^s(\cdot|\cdot), \mathbf{x})$ , where $\text{Pr}^t(\cdot|\cdot)$ is the probability distribution of the pre-trained teacher model, and $\text{Pr}_{\theta}^s(\cdot|\cdot)$ is the distribution of the student model with parameters $\theta$ . The training objective is to minimize this loss, thereby teaching the student to replicate the teacher's behavior.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After