Learn Before
A machine learning team is developing a compact language model (the 'student') by training it to learn from a much larger, high-performing model (the 'teacher'). They conduct two experiments with identical student model architectures:
- Experiment 1: The student model is trained solely by minimizing the difference between its final output predictions and the teacher model's final output predictions.
- Experiment 2: In addition to matching the final predictions, the student model is also trained to minimize the difference between its own intermediate layer representations and the corresponding intermediate layer representations of the teacher model.
The team observes that the model from Experiment 2 achieves significantly better performance on a diverse set of new, unseen tasks compared to the model from Experiment 1.
Which of the following provides the most accurate analysis of this outcome?
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A machine learning team is developing a compact language model (the 'student') by training it to learn from a much larger, high-performing model (the 'teacher'). They conduct two experiments with identical student model architectures:
- Experiment 1: The student model is trained solely by minimizing the difference between its final output predictions and the teacher model's final output predictions.
- Experiment 2: In addition to matching the final predictions, the student model is also trained to minimize the difference between its own intermediate layer representations and the corresponding intermediate layer representations of the teacher model.
The team observes that the model from Experiment 2 achieves significantly better performance on a diverse set of new, unseen tasks compared to the model from Experiment 1.
Which of the following provides the most accurate analysis of this outcome?
When using a large 'teacher' model to train a smaller 'student' model, the only way to transfer knowledge is by training the student to replicate the teacher's final output predictions.
Enhancing Knowledge Transfer in Model Distillation