Learn Before
Evaluating Student Model Performance
A large 'teacher' model and two smaller 'student' models (A and B) are given the same input. Their task is to predict the next word from a vocabulary of three words: {apple, banana, cherry}. The models produce the following probability distributions for the next word. The training objective is to minimize the divergence from the teacher's distribution to the student's distribution.
-
Teacher Model Distribution (P):
- P(apple) = 0.7
- P(banana) = 0.2
- P(cherry) = 0.1
-
Student Model A Distribution (Q_A):
- Q_A(apple) = 0.6
- Q_A(banana) = 0.3
- Q_A(cherry) = 0.1
-
Student Model B Distribution (Q_B):
- Q_B(apple) = 0.8
- Q_B(banana) = 0.1
- Q_B(cherry) = 0.1
Using the formula for the loss, Loss = Σ P(x) * log(P(x) / Q(x)), calculate the loss for both Student A and Student B. Based on your calculations, which student model is more effectively mimicking the teacher model for this specific input? Explain your reasoning. (Use the natural logarithm, ln, for your calculations).
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Combined Training Objective for Knowledge Distillation
In a model training setup, a smaller 'student' model is trained to mimic the output probability distribution of a larger 'teacher' model for a given input. The training objective is to minimize the Kullback-Leibler (KL) divergence between the two distributions. The standard loss function is defined as . A researcher proposes an alternative loss function, . How would minimizing instead of most likely change the student model's behavior?
Evaluating Student Model Performance
In a knowledge distillation process, a 'teacher' model produces a probability distribution of
[0.8, 0.1, 0.1]over three classes for a given input. Four different 'student' models are being evaluated on the same input, producing the distributions below. Which student model's output distribution is being most effectively guided by the teacher, as measured by the standard Kullback-Leibler (KL) divergence loss function?Adjusting the Distillation Loss Coefficient