Combined Training Objective Formula for Knowledge Distillation
To incorporate knowledge distillation loss into language modeling, one can add this auxiliary loss to the original language modeling loss to yield a combined training objective. This is formulated as a maximization problem to find the optimal student (large model) parameters :
In this formula, is the set of input and output pairs, and is the interpolation coefficient. The first term, , is the log-likelihood of the ground-truth data, encouraging the model to learn from actual labels. The second term, , is the knowledge distillation loss that pushes the student model to mimic the small teacher model. This method can be employed in either the pre-training or fine-tuning phase.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Combined Training Objective Formula for Knowledge Distillation
Dynamic Adjustment of the Knowledge Distillation Coefficient (λ)
Optimizing Student Model Training
When training a smaller 'student' model using a combined objective that learns from both a larger 'teacher' model and the ground-truth data, what is the primary role of the component that learns directly from the ground-truth data?
A student model is being trained using a combined objective that incorporates learning from both a larger 'teacher' model and the ground-truth data. Match each learning source with its primary contribution to the student model's training process.
Learn After
An engineer is training a small 'student' model by learning from a larger 'teacher' model. The training objective is to find the student parameters (θ) that maximize a combined score, formulated as: where 'Term A' measures how well the student predicts the correct, ground-truth answers, and 'Term B' measures how closely the student's outputs match the teacher's outputs. After training, the engineer notices the student model is replicating systematic errors present in the teacher model, leading to poor performance on a validation set. Which adjustment to the hyperparameter λ is the most appropriate first step to address this issue?
Analyzing the Knowledge Distillation Hyperparameter
A machine learning team is using a combined objective to train a small 'student' model. The goal is to find the student model's parameters (θ) that maximize the following expression: The first term, , measures how well the student predicts the ground-truth labels . The second term, , measures the difference between the student's and a larger 'teacher' model's predictions. The team is working with a dataset where the ground-truth labels are known to be somewhat noisy and contain occasional errors. However, the large teacher model has been shown to provide very reliable and well-generalized predictions. Given this situation, how should the team adjust the hyperparameter to optimize the student model's performance?