Formula

Target-Generated Output Loss for Context Distillation

To overcome the computational infeasibility of the sequence-level loss, a variant of context distillation trains the student model using specific outputs generated by the teacher model. For each sample, the teacher model produces an output y^=arg maxylogPrt(yc,z)\hat{\mathbf{y}} = \argmax_{\mathbf{y}} \log \mathrm{Pr}^{t}(\mathbf{y}|\mathbf{c},\mathbf{z}), which is then considered the target for learning. The simplified loss function becomes:

Loss=logPrθs(y^c,z)\mathrm{Loss} = \log \mathrm{Pr}_{\theta}^{s}(\hat{\mathbf{y}}|\mathbf{c}', \mathbf{z})

0

1

Updated 2026-04-30

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences