Critique of a Modified Training Objective
A common method for training a small 'student' model involves finding its optimal parameters, , by minimizing a loss function that compares its output distribution, , to a larger 'teacher' model's distribution, , over a dataset . The objective is:
A researcher proposes simplifying this process by replacing the teacher's full probability distribution, , with only its single most probable output (i.e., using a one-hot encoded vector corresponding to the teacher's top prediction). Critically evaluate this modification. What specific type of information from the teacher would the student model fail to learn, and why is this information valuable?
0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Cross-Entropy Loss for Knowledge Distillation
Using KL Divergence for Knowledge Distillation Loss
A research team is training a small, efficient 'student' model to replicate the behavior of a large, powerful 'teacher' model. The team's goal is to find the optimal parameters for the student model () by minimizing a loss function over a dataset of simplified inputs (), as defined by the following objective:
Where is the teacher's output probability distribution and is the student's.
If the team mistakenly configures the training process to use the teacher's original, complex dataset instead of the intended simplified dataset , which of the following outcomes is the most direct and likely consequence for the student model?
Critique of a Modified Training Objective
Diagnosing a Knowledge Distillation Training Issue