1Cademy - Context Distillation Loss Function

Learn Before

Context Distillation

Formula

Context Distillation Loss Function

Knowledge distillation in the context distillation method is performed by minimizing a loss function defined on the outputs of the teacher and student models:

$\hat{\theta} = \argmin_{\theta} \sum_{\mathbf{x}' \in \mathcal{D}'} \mathrm{Loss}(\mathrm{Pr}^{t}(\cdot|\cdot),\ \mathrm{Pr}_{\theta}^{s}(\cdot|\cdot),\ \mathbf{x}')$

where $\mathrm{Pr}^{t}(\cdot|\cdot)$ denotes the pre-trained teacher model, and $\mathrm{Pr}_{\theta}^{s}(\cdot|\cdot)$ denotes the student model with the parameters $\theta$ .