1Cademy - Critique of a Modified Training Objective

Learn Before

Objective Function for Student Model Training via Knowledge Distillation

Short Answer

Critique of a Modified Training Objective

A common method for training a small 'student' model involves finding its optimal parameters, $\hat{\theta}$ , by minimizing a loss function that compares its output distribution, $\text{Pr}_{\theta}^s$ , to a larger 'teacher' model's distribution, $\text{Pr}^t$ , over a dataset $\mathcal{D}'$ . The objective is:

$\hat{\theta} = \arg\min_{\theta} \sum_{\mathbf{x}' \in \mathcal{D}'} \text{Loss}(\text{Pr}^t(\cdot|\cdot), \text{Pr}_{\theta}^s(\cdot|\cdot), \mathbf{x}')$

A researcher proposes simplifying this process by replacing the teacher's full probability distribution, $\text{Pr}^t(\cdot|\cdot)$ , with only its single most probable output (i.e., using a one-hot encoded vector corresponding to the teacher's top prediction). Critically evaluate this modification. What specific type of information from the teacher would the student model fail to learn, and why is this information valuable?

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related