Learn Before
When implementing response-based knowledge distillation, the loss function is calculated by first applying a softmax function to the teacher and student model outputs to convert them into probability distributions, and then measuring the divergence between these two distributions.
0
1
Tags
Deep Learning (in Machine learning)
Data Science
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An engineer is training a small 'student' model to mimic the predictions of a larger, pre-trained 'teacher' model. The training objective is to make the student's final, pre-activation output vector as similar as possible to the teacher's. If
z_tis the teacher's output vector andz_sis the student's output vector for the same input, which of the following loss functions correctly implements this objective?Rationale for Using Logits in Distillation Loss
When implementing response-based knowledge distillation, the loss function is calculated by first applying a softmax function to the teacher and student model outputs to convert them into probability distributions, and then measuring the divergence between these two distributions.