Formula

Objective Function for Student Model Training via Knowledge Distillation

The optimal parameters θ^\hat{\theta} for a student model are found by minimizing a loss function over a dataset D\mathcal{D}' of simplified inputs. This process is defined by the formula: θ^=argminθxDLoss(Prt(),Prθs(),x)\hat{\theta} = \arg\min_{\theta} \sum_{\mathbf{x}' \in \mathcal{D}'} \text{Loss}(\text{Pr}^t(\cdot|\cdot), \text{Pr}_{\theta}^s(\cdot|\cdot), \mathbf{x}'). The loss function measures the discrepancy between the output probability distribution of the teacher model, Prt()\text{Pr}^t(\cdot|\cdot), and that of the student model, Prθs()\text{Pr}_{\theta}^s(\cdot|\cdot), for each input x\mathbf{x}'.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related