Knowledge Distillation Loss using KL Divergence
Instead of just using small models to generate synthetic data, one can incorporate knowledge distillation loss based on these models. The knowledge distillation loss, denoted as , quantifies the difference between the output probability distributions of a teacher (or small) model and a student (or large) model. It is formally defined using the Kullback-Leibler (KL) divergence as:
Here, is the probability distribution produced by the teacher (or weak) model, and is the distribution from the student model with parameters , given an input . This simple loss function measures the difference between the small and large models and is minimized to encourage the large model to mimic the small model's behavior.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.3 Prompting - Foundations of Large Language Models
Related
Knowledge Distillation Loss using KL Divergence
KL Divergence Loss for Knowledge Distillation
A compact computational model is being trained to replicate the probabilistic outputs of a large, established reference model. The training process aims to minimize the dissimilarity between the two models' full output distributions for any given input. Below are the output probability distributions from the reference model and three potential outputs from the compact model for the same input.
Reference Model Output:
[0.70, 0.20, 0.10]Which of the compact model outputs below demonstrates the most successful replication of the reference model's output distribution, considering the goal is to match the entire distribution, not just the most likely outcome?
Compact Model - Output A:
[0.65, 0.22, 0.13]Compact Model - Output B:[0.70, 0.10, 0.20]Compact Model - Output C:[0.50, 0.30, 0.20]Rationale for Distribution Matching in Model Training
Knowledge Distillation Loss using KL Divergence
Analyzing Model Training Scenarios
An engineering team is developing a compact, fast model to replicate the predictions of a much larger, more complex model for a 5-category classification task. They use a specific mathematical function to calculate a 'dissimilarity score' between the probability distributions produced by the two models for each input. A lower score indicates the outputs are more similar. After several training epochs, they observe the average dissimilarity score on a validation dataset has significantly decreased. What is the most accurate interpretation of this observation?
A small, efficient model is being trained to emulate the behavior of a large, powerful model on a 3-category classification task. A mathematical function is used to calculate a 'dissimilarity score' between the probability distributions produced by the two models for a given input, where a higher score indicates a greater difference. For which of the following scenarios would this dissimilarity score be the highest?
Knowledge Distillation Loss using KL Divergence
Evaluating Model Mimicry Performance
Learn After
Combined Training Objective for Knowledge Distillation
In a model training setup, a smaller 'student' model is trained to mimic the output probability distribution of a larger 'teacher' model for a given input. The training objective is to minimize the Kullback-Leibler (KL) divergence between the two distributions. The standard loss function is defined as . A researcher proposes an alternative loss function, . How would minimizing instead of most likely change the student model's behavior?
Evaluating Student Model Performance
In a knowledge distillation process, a 'teacher' model produces a probability distribution of
[0.8, 0.1, 0.1]over three classes for a given input. Four different 'student' models are being evaluated on the same input, producing the distributions below. Which student model's output distribution is being most effectively guided by the teacher, as measured by the standard Kullback-Leibler (KL) divergence loss function?Adjusting the Distillation Loss Coefficient