Learn Before
Rationale for Using Logits in Distillation Loss
In the context of transferring knowledge from a large 'teacher' model to a smaller 'student' model, why is it often more effective to calculate the divergence loss directly on the raw, pre-activation output vectors (logits) of the two models, rather than on the final probability distributions produced after applying a standard activation function (like softmax)?
0
1
Tags
Deep Learning (in Machine learning)
Data Science
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An engineer is training a small 'student' model to mimic the predictions of a larger, pre-trained 'teacher' model. The training objective is to make the student's final, pre-activation output vector as similar as possible to the teacher's. If
z_tis the teacher's output vector andz_sis the student's output vector for the same input, which of the following loss functions correctly implements this objective?Rationale for Using Logits in Distillation Loss
When implementing response-based knowledge distillation, the loss function is calculated by first applying a softmax function to the teacher and student model outputs to convert them into probability distributions, and then measuring the divergence between these two distributions.