1Cademy - Rationale for Using Logits in Distillation Loss

Learn Before

Distillation Loss for Response-Based Knowledge

Short Answer

Rationale for Using Logits in Distillation Loss

In the context of transferring knowledge from a large 'teacher' model to a smaller 'student' model, why is it often more effective to calculate the divergence loss directly on the raw, pre-activation output vectors (logits) of the two models, rather than on the final probability distributions produced after applying a standard activation function (like softmax)?

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related