Learn Before
Multiple Choice

A machine learning team is developing a compact model for a mobile application. They have a large, highly accurate 'teacher' model and a smaller 'student' model architecture. Instead of training the student model directly on the original dataset with its ground-truth labels (e.g., 'this image is a cat'), they train it to mimic the full output probability distribution of the teacher model (e.g., '90% cat, 5% dog, 1% tiger...'). Why is this technique often more effective for the student model's performance than training it from scratch on the original labels?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Deep Learning (in Machine learning)

Data Science

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science