Short Answer

Critique of a Modified Training Objective

A common method for training a small 'student' model involves finding its optimal parameters, θ^\hat{\theta}, by minimizing a loss function that compares its output distribution, Prθs\text{Pr}_{\theta}^s, to a larger 'teacher' model's distribution, Prt\text{Pr}^t, over a dataset D\mathcal{D}'. The objective is:

θ^=argminθxDLoss(Prt(),Prθs(),x)\hat{\theta} = \arg\min_{\theta} \sum_{\mathbf{x}' \in \mathcal{D}'} \text{Loss}(\text{Pr}^t(\cdot|\cdot), \text{Pr}_{\theta}^s(\cdot|\cdot), \mathbf{x}')

A researcher proposes simplifying this process by replacing the teacher's full probability distribution, Prt()\text{Pr}^t(\cdot|\cdot), with only its single most probable output (i.e., using a one-hot encoded vector corresponding to the teacher's top prediction). Critically evaluate this modification. What specific type of information from the teacher would the student model fail to learn, and why is this information valuable?

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science