Learn Before
Formula

Context Distillation Loss Function

Knowledge distillation in the context distillation method is performed by minimizing a loss function defined on the outputs of the teacher and student models:

θ^=arg minθxDLoss(Prt(), Prθs(), x)\hat{\theta} = \argmin_{\theta} \sum_{\mathbf{x}' \in \mathcal{D}'} \mathrm{Loss}(\mathrm{Pr}^{t}(\cdot|\cdot),\ \mathrm{Pr}_{\theta}^{s}(\cdot|\cdot),\ \mathbf{x}')

where Prt()\mathrm{Pr}^{t}(\cdot|\cdot) denotes the pre-trained teacher model, and Prθs()\mathrm{Pr}_{\theta}^{s}(\cdot|\cdot) denotes the student model with the parameters θ\theta.

0

1

Updated 2026-04-30

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences