To overcome the computational infeasibility of the sequence-level loss, a variant of context distillation trains the student model using specific outputs generated by the teacher model. For each sample, the teacher model produces an output $$\hat{\mathbf{y}} = \argmax_{\mathbf{y}} \log \mathrm{Pr}^{t}(\mathbf{y}|\mathbf{c},\mathbf{z})$$, which is then considered the target for learning. The simplified loss function becomes:

$$\mathrm{Loss} = \log \mathrm{Pr}_{\theta}^{s}(\hat{\mathbf{y}}|\mathbf{c}', \mathbf{z})$$

Google

A commonly used loss function for context distillation is the sequence-level loss, which calculates the error over an entire sequence. It takes the basic form:

$$\mathrm{Loss} = \sum_{\mathbf{y}} \mathrm{Pr}^{t}(\mathbf{y}|\mathbf{c},\mathbf{z}) \log \mathrm{Pr}_{\theta}^{s}(\mathbf{y}|\mathbf{c}',\mathbf{z})$$

where $$\mathbf{c}$$ is the original instruction, $$\mathbf{c}'$$ is the simplified instruction, and $$\mathbf{z}$$ is the user input. However, this function is computationally infeasible in practice because it requires summing over an exponentially large number of possible outputs $$\mathbf{y}$$.

Learn Before

Related