1Cademy - Diagnosing a Knowledge Distillation Training Issue

Learn Before

Objective Function for Student Model Training via Knowledge Distillation

Case Study

Diagnosing a Knowledge Distillation Training Issue

Based on the provided training objective and the observed outcome, what is the most likely cause of the student model's poor performance on complex sentences? Explain your reasoning by referencing the components of the objective function.

Updated 2025-10-09

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Cross-Entropy Loss for Knowledge Distillation
Using KL Divergence for Knowledge Distillation Loss
A research team is training a small, efficient 'student' model to replicate the behavior of a large, powerful 'teacher' model. The team's goal is to find the optimal parameters for the student model ( $\hat{\theta}$ ) by minimizing a loss function over a dataset of simplified inputs ( $\mathcal{D}'$ ), as defined by the following objective:

$\hat{\theta} = \arg\min_{\theta} \sum_{\mathbf{x}' \in \mathcal{D}'} \text{Loss}(\text{Pr}^t(\cdot|\cdot), \text{Pr}_{θ}^s(\cdot|\cdot), \mathbf{x}')$

Where $\text{Pr}^t$ is the teacher's output probability distribution and $\text{Pr}_{θ}^s$ is the student's.

If the team mistakenly configures the training process to use the teacher's original, complex dataset instead of the intended simplified dataset $\mathcal{D}'$ , which of the following outcomes is the most direct and likely consequence for the student model?
Critique of a Modified Training Objective
Diagnosing a Knowledge Distillation Training Issue

Learn Before

Related