1Cademy - A machine learning team is developing a compact model for a mobile application. They have a large, highly accurate teacher model and a smaller student model architecture. Instead of training the student model directly on the original dataset with its ground-truth labels (e.g., this image is a cat), they train it to mimic the full output probability distribution of the teacher model (e.g., 90% cat, 5% dog, 1% tiger...). Why is this technique often more effective for the student models performance than training it from scratch on the original labels?

Learn Before

Knowledge Distillation

Multiple Choice

A machine learning team is developing a compact model for a mobile application. They have a large, highly accurate 'teacher' model and a smaller 'student' model architecture. Instead of training the student model directly on the original dataset with its ground-truth labels (e.g., 'this image is a cat'), they train it to mimic the full output probability distribution of the teacher model (e.g., '90% cat, 5% dog, 1% tiger...'). Why is this technique often more effective for the student model's performance than training it from scratch on the original labels?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related