Learn Before
Objective Function for Student Model Training via Knowledge Distillation
Sequence-Level Loss
Definition of Teacher's Probability Distribution (Pt) in Knowledge Distillation
Cross-Entropy Loss for Knowledge Distillation
A frequently used loss function in knowledge distillation is the sequence-level loss, which often takes the form of cross-entropy. This loss measures the dissimilarity between the teacher model's output distribution, , and the student model's distribution, . The total loss is calculated by summing the log probability of the student's predictions over all possible output sequences , weighted by the teacher's probability for each sequence. The formula is expressed as:

0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Cross-Entropy Loss for Knowledge Distillation
Using KL Divergence for Knowledge Distillation Loss
A research team is training a small, efficient 'student' model to replicate the behavior of a large, powerful 'teacher' model. The team's goal is to find the optimal parameters for the student model () by minimizing a loss function over a dataset of simplified inputs (), as defined by the following objective:
Where is the teacher's output probability distribution and is the student's.
If the team mistakenly configures the training process to use the teacher's original, complex dataset instead of the intended simplified dataset , which of the following outcomes is the most direct and likely consequence for the student model?
Loss Function for RNN
Sample-wise Negative Log-Likelihood Loss for a Sub-sequence
Cross-Entropy Loss for Knowledge Distillation
A language model is being trained to generate the four-word sentence 'The quick brown fox'. The model generates one word at a time, and the error (loss) is calculated at each step:
- Loss for 'The' = 0.1
- Loss for 'quick' = 0.3
- Loss for 'brown' = 0.2
- Loss for 'fox' = 0.4
To update the model's parameters, the training process computes a single, overall loss value for the entire sentence. Which statement best analyzes this method of calculating the overall loss?
Total Loss Calculation for a Token Sequence
Calculating Average Sequence-Level Loss
KL Divergence Loss for Knowledge Distillation
Cross-Entropy Loss for Knowledge Distillation
A large, complex language model is used to generate target probabilities for training a smaller, more efficient model. For the input sentence 'The cat sat on the ___', the large model could produce different probability distributions for the next word. Which of the following distributions, representing , would provide the most informative and nuanced training signal for the smaller model?
Learn After
Computational Infeasibility of Full Output Summation in Distillation Loss
A student model is trained to mimic a teacher model by minimizing the following loss function, which measures the dissimilarity between their output probability distributions for a given input:
In this formula, is the teacher's probability for an output sequence , is the student's probability, and the summation is over all possible output sequences. What is the primary function of the summation () over the entire space of possible outputs?