Analyzing the Knowledge Distillation Hyperparameter
Consider the combined training objective for a student model in knowledge distillation: Explain the potential negative consequence for the student model's performance if the hyperparameter λ is set to a very high value, and justify your explanation by referencing the two main components of the formula.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An engineer is training a small 'student' model by learning from a larger 'teacher' model. The training objective is to find the student parameters (θ) that maximize a combined score, formulated as: where 'Term A' measures how well the student predicts the correct, ground-truth answers, and 'Term B' measures how closely the student's outputs match the teacher's outputs. After training, the engineer notices the student model is replicating systematic errors present in the teacher model, leading to poor performance on a validation set. Which adjustment to the hyperparameter λ is the most appropriate first step to address this issue?
Analyzing the Knowledge Distillation Hyperparameter
A machine learning team is using a combined objective to train a small 'student' model. The goal is to find the student model's parameters (θ) that maximize the following expression: The first term, , measures how well the student predicts the ground-truth labels . The second term, , measures the difference between the student's and a larger 'teacher' model's predictions. The team is working with a dataset where the ground-truth labels are known to be somewhat noisy and contain occasional errors. However, the large teacher model has been shown to provide very reliable and well-generalized predictions. Given this situation, how should the team adjust the hyperparameter to optimize the student model's performance?