Multiple Choice

A machine learning team is developing a compact language model (the 'student') by training it to learn from a much larger, high-performing model (the 'teacher'). They conduct two experiments with identical student model architectures:

  • Experiment 1: The student model is trained solely by minimizing the difference between its final output predictions and the teacher model's final output predictions.
  • Experiment 2: In addition to matching the final predictions, the student model is also trained to minimize the difference between its own intermediate layer representations and the corresponding intermediate layer representations of the teacher model.

The team observes that the model from Experiment 2 achieves significantly better performance on a diverse set of new, unseen tasks compared to the model from Experiment 1.

Which of the following provides the most accurate analysis of this outcome?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science