Learn Before
  • Objective Function for Student Model Training via Knowledge Distillation

Using KL Divergence for Knowledge Distillation Loss

An alternative approach to knowledge distillation loss involves directly minimizing the discrepancy between the output probability distributions of the teacher and student models. The Kullback-Leibler (KL) divergence is a common metric used to formulate this loss function, quantifying the 'distance' between the two distributions.

Image 0

0

1

3 months ago

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • Cross-Entropy Loss for Knowledge Distillation

  • Using KL Divergence for Knowledge Distillation Loss

  • A research team is training a small, efficient 'student' model to replicate the behavior of a large, powerful 'teacher' model. The team's goal is to find the optimal parameters for the student model (θ^\hat{\theta}) by minimizing a loss function over a dataset of simplified inputs (D\mathcal{D}'), as defined by the following objective:

    θ^=argminθxDLoss(Prt(),Prθs(),x)\hat{\theta} = \arg\min_{\theta} \sum_{\mathbf{x}' \in \mathcal{D}'} \text{Loss}(\text{Pr}^t(\cdot|\cdot), \text{Pr}_{θ}^s(\cdot|\cdot), \mathbf{x}')

    Where Prt\text{Pr}^t is the teacher's output probability distribution and Prθs\text{Pr}_{θ}^s is the student's.

    If the team mistakenly configures the training process to use the teacher's original, complex dataset instead of the intended simplified dataset D\mathcal{D}', which of the following outcomes is the most direct and likely consequence for the student model?

  • Critique of a Modified Training Objective

  • Diagnosing a Knowledge Distillation Training Issue

Learn After
  • KL Divergence Loss for Knowledge Distillation

  • A compact computational model is being trained to replicate the probabilistic outputs of a large, established reference model. The training process aims to minimize the dissimilarity between the two models' full output distributions for any given input. Below are the output probability distributions from the reference model and three potential outputs from the compact model for the same input.

    Reference Model Output: [0.70, 0.20, 0.10]

    Which of the compact model outputs below demonstrates the most successful replication of the reference model's output distribution, considering the goal is to match the entire distribution, not just the most likely outcome?

    Compact Model - Output A: [0.65, 0.22, 0.13] Compact Model - Output B: [0.70, 0.10, 0.20] Compact Model - Output C: [0.50, 0.30, 0.20]

  • Rationale for Distribution Matching in Model Training

  • Knowledge Distillation Loss using KL Divergence

  • Analyzing Model Training Scenarios