1Cademy - Sequence-Level Loss in Context Distillation

Learn Before

Context Distillation Loss Function

Formula

Sequence-Level Loss in Context Distillation

A commonly used loss function for context distillation is the sequence-level loss, which calculates the error over an entire sequence. It takes the basic form:

$\mathrm{Loss} = \sum_{\mathbf{y}} \mathrm{Pr}^{t}(\mathbf{y}|\mathbf{c},\mathbf{z}) \log \mathrm{Pr}_{\theta}^{s}(\mathbf{y}|\mathbf{c}',\mathbf{z})$

where $\mathbf{c}$ is the original instruction, $\mathbf{c}'$ is the simplified instruction, and $\mathbf{z}$ is the user input. However, this function is computationally infeasible in practice because it requires summing over an exponentially large number of possible outputs $\mathbf{y}$ .