Cross-Entropy Loss for Knowledge Distillation
A frequently used loss function in knowledge distillation is the sequence-level loss, which often takes the form of cross-entropy. This loss measures the dissimilarity between the teacher model's output distribution, , and the student model's distribution, . The total loss is calculated by summing the log probability of the student's predictions over all possible output sequences , weighted by the teacher's probability for each sequence. The formula is expressed as:

0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Cross-Entropy Loss for Knowledge Distillation
Using KL Divergence for Knowledge Distillation Loss
A research team is training a small, efficient 'student' model to replicate the behavior of a large, powerful 'teacher' model. The team's goal is to find the optimal parameters for the student model () by minimizing a loss function over a dataset of simplified inputs (), as defined by the following objective:
Where is the teacher's output probability distribution and is the student's.
If the team mistakenly configures the training process to use the teacher's original, complex dataset instead of the intended simplified dataset , which of the following outcomes is the most direct and likely consequence for the student model?
Critique of a Modified Training Objective
Diagnosing a Knowledge Distillation Training Issue
Loss Function for RNN
Sample-wise Negative Log-Likelihood Loss for a Sub-sequence
Cross-Entropy Loss for Knowledge Distillation
A language model is being trained to generate the four-word sentence 'The quick brown fox'. The model generates one word at a time, and the error (loss) is calculated at each step:
- Loss for 'The' = 0.1
- Loss for 'quick' = 0.3
- Loss for 'brown' = 0.2
- Loss for 'fox' = 0.4
To update the model's parameters, the training process computes a single, overall loss value for the entire sentence. Which statement best analyzes this method of calculating the overall loss?
Total Loss Calculation for a Token Sequence
Calculating Average Sequence-Level Loss
Evaluating Training Strategies for a Translation Model
KL Divergence Loss for Knowledge Distillation
Cross-Entropy Loss for Knowledge Distillation
A large, complex language model is used to generate target probabilities for training a smaller, more efficient model. For the input sentence 'The cat sat on the ___', the large model could produce different probability distributions for the next word. Which of the following distributions, representing , would provide the most informative and nuanced training signal for the smaller model?
Value of the Teacher's Probability Distribution
In a knowledge distillation process for a machine translation task, a large 'teacher' model translates the sentence 'Je suis content' from French to English. Instead of just outputting 'I am happy', the teacher model produces a full probability distribution over the entire English vocabulary for the next words. Which statement best analyzes the significance of this probability distribution () for training the smaller 'student' model?
Learn After
Computational Infeasibility of Full Output Summation in Distillation Loss
A student model is trained to mimic a teacher model by minimizing the following loss function, which measures the dissimilarity between their output probability distributions for a given input:
In this formula, is the teacher's probability for an output sequence , is the student's probability, and the summation is over all possible output sequences. What is the primary function of the summation () over the entire space of possible outputs?
Evaluating a Loss Function for a Machine Translation Task
A student model is being trained to replicate the output distribution of a teacher model using the loss function:
Suppose for a given input, there are only three possible output sequences: A, B, and C. The teacher model assigns the following probabilities:
Pr^t(A) = 0.8Pr^t(B) = 0.15Pr^t(C) = 0.05
Two different student models produce the following distributions:
- Student 1:
Pr^s(A) = 0.6,Pr^s(B) = 0.3,Pr^s(C) = 0.1 - Student 2:
Pr^s(A) = 0.6,Pr^s(B) = 0.1,Pr^s(C) = 0.3
Without calculating the exact loss, which student model will achieve a lower loss value, and why?