Learn Before
A student model is trained to mimic a teacher model by minimizing the following loss function, which measures the dissimilarity between their output probability distributions for a given input:
In this formula, is the teacher's probability for an output sequence , is the student's probability, and the summation is over all possible output sequences. What is the primary function of the summation () over the entire space of possible outputs?
0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Computational Infeasibility of Full Output Summation in Distillation Loss
A student model is trained to mimic a teacher model by minimizing the following loss function, which measures the dissimilarity between their output probability distributions for a given input:
In this formula, is the teacher's probability for an output sequence , is the student's probability, and the summation is over all possible output sequences. What is the primary function of the summation () over the entire space of possible outputs?
Evaluating a Loss Function for a Machine Translation Task
A student model is being trained to replicate the output distribution of a teacher model using the loss function:
Suppose for a given input, there are only three possible output sequences: A, B, and C. The teacher model assigns the following probabilities:
Pr^t(A) = 0.8Pr^t(B) = 0.15Pr^t(C) = 0.05
Two different student models produce the following distributions:
- Student 1:
Pr^s(A) = 0.6,Pr^s(B) = 0.3,Pr^s(C) = 0.1 - Student 2:
Pr^s(A) = 0.6,Pr^s(B) = 0.1,Pr^s(C) = 0.3
Without calculating the exact loss, which student model will achieve a lower loss value, and why?