Learn Before
Adjusting the Distillation Loss Coefficient
In the combined training objective for large language models, given a dataset of input and output pairs (denoted as ), the interpolation coefficient (denoted as ) determines the degree of influence a smaller model has on the training process. This coefficient can be adjusted during either the pre-training or fine-tuning phases. A common strategy is to gradually decrease , which reduces the reliance on the small model's knowledge distillation loss and places more emphasis on the original language modeling loss as the larger model becomes more capable.
0
1
Tags
Foundations of Large Language Models
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Combined Training Objective for Knowledge Distillation
In a model training setup, a smaller 'student' model is trained to mimic the output probability distribution of a larger 'teacher' model for a given input. The training objective is to minimize the Kullback-Leibler (KL) divergence between the two distributions. The standard loss function is defined as . A researcher proposes an alternative loss function, . How would minimizing instead of most likely change the student model's behavior?
Evaluating Student Model Performance
In a knowledge distillation process, a 'teacher' model produces a probability distribution of
[0.8, 0.1, 0.1]over three classes for a given input. Four different 'student' models are being evaluated on the same input, producing the distributions below. Which student model's output distribution is being most effectively guided by the teacher, as measured by the standard Kullback-Leibler (KL) divergence loss function?Adjusting the Distillation Loss Coefficient