Learn Before
In a model training setup, a smaller 'student' model is trained to mimic the output probability distribution of a larger 'teacher' model for a given input. The training objective is to minimize the Kullback-Leibler (KL) divergence between the two distributions. The standard loss function is defined as . A researcher proposes an alternative loss function, . How would minimizing instead of most likely change the student model's behavior?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Combined Training Objective for Knowledge Distillation
In a model training setup, a smaller 'student' model is trained to mimic the output probability distribution of a larger 'teacher' model for a given input. The training objective is to minimize the Kullback-Leibler (KL) divergence between the two distributions. The standard loss function is defined as . A researcher proposes an alternative loss function, . How would minimizing instead of most likely change the student model's behavior?
Evaluating Student Model Performance
In a knowledge distillation process, a 'teacher' model produces a probability distribution of
[0.8, 0.1, 0.1]over three classes for a given input. Four different 'student' models are being evaluated on the same input, producing the distributions below. Which student model's output distribution is being most effectively guided by the teacher, as measured by the standard Kullback-Leibler (KL) divergence loss function?Adjusting the Distillation Loss Coefficient