Learn Before
Computational Infeasibility of Full Output Summation in Distillation Loss
The direct application of the cross-entropy loss function for knowledge distillation is often computationally impractical. This is because the formula requires a summation over the entire set of possible outputs, which can be exponentially large, making the calculation infeasible in many real-world scenarios.
0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Computational Infeasibility of Full Output Summation in Distillation Loss
A student model is trained to mimic a teacher model by minimizing the following loss function, which measures the dissimilarity between their output probability distributions for a given input:
In this formula, is the teacher's probability for an output sequence , is the student's probability, and the summation is over all possible output sequences. What is the primary function of the summation () over the entire space of possible outputs?
Evaluating a Loss Function for a Machine Translation Task
A student model is being trained to replicate the output distribution of a teacher model using the loss function:
Suppose for a given input, there are only three possible output sequences: A, B, and C. The teacher model assigns the following probabilities:
Pr^t(A) = 0.8Pr^t(B) = 0.15Pr^t(C) = 0.05
Two different student models produce the following distributions:
- Student 1:
Pr^s(A) = 0.6,Pr^s(B) = 0.3,Pr^s(C) = 0.1 - Student 2:
Pr^s(A) = 0.6,Pr^s(B) = 0.1,Pr^s(C) = 0.3
Without calculating the exact loss, which student model will achieve a lower loss value, and why?