General Loss Function for Knowledge Distillation
The general loss function for knowledge distillation, often written as for simplicity, measures the discrepancy between a teacher and student model for a given input . The function is formally expressed as , where is the probability distribution of the pre-trained teacher model, and is the distribution of the student model with parameters . The training objective is to minimize this loss, thereby teaching the student to replicate the teacher's behavior.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.3 Prompting - Foundations of Large Language Models
Related
Distillation Loss for Response-Based Knowledge
Objective Function for Student Model Training via Knowledge Distillation
Definition of Teacher's Probability Distribution (Pt) in Knowledge Distillation
Definition of Student's Probability Distribution (P_theta^s)
General Loss Function for Knowledge Distillation
Optimizing a Language Model for Mobile Deployment
Definition of Student's Probability Distribution ()
A research lab has developed a very large and complex language model that achieves state-of-the-art performance on a translation task. However, due to its size, the model is too slow and expensive to deploy for a real-time translation mobile app. To address this, the team uses the large model's predictions on a set of sentences to train a new, much smaller and faster model. What is the primary strategic advantage of this two-model approach?
A development team is using a knowledge distillation framework to create a compact, efficient language model (the 'student') from a much larger, high-performance model (the 'teacher'). The goal is to deploy the student model on devices with limited computational resources. Which statement best analyzes the typical relationship between the inputs processed by the teacher and student models during this process?
Learn After
Deconstructing the Knowledge Transfer Loss Function
An engineer is training a compact 'student' model to replicate the behavior of a larger 'teacher' model. The training process aims to minimize a loss function that measures the difference between the output probability distributions of the two models for any given input. If the loss value remains high throughout the training, what is the most direct conclusion?
Analyzing the Components of a Model Mimicry Loss Function