The benchmark model in knowledge distillation employs a joint loss function that combines the distillation loss and the student loss. The student loss is typically the cross-entropy loss between the ground truth label and the soft logits of the student model, expressed as $$L_{CE}(y, p(z_s, T = 1))$$.

University of California, Berkeley

Google

Response based knowledge is simple, effective, and widely-applicable. It refers to the neural response of the teacher model’s last output layer. It copies the teacher model’s final prediction exactly. 


Response based knowledge


- Object detection: Example response: logits with a bounding box offset 
- Semantic landmark localization: Example teacher response: a heat map for each landmark
- Soft targets are the class probabilities of the input. It is the most popular response-based knowledge for image classification.

Example Use Cases 


The distillation loss for response-based knowledge transfer is calculated as the divergence between the logit vectors from the teacher and student models. This can be formally expressed as $$L_{ResD}(z_t, z_s) = L_R(z_t, z_s)$$, where $$z_t$$ and $$z_s$$ are the logits from the teacher and student models, respectively, and $$L_R(\cdot)$$ represents the divergence loss function.

Distillation Loss for Response-Based Knowledge

Benchmark Model in Knowledge Distillation

Response-based knowledge distillation relies exclusively on the last layer’s output, meaning it does not encapsulate intermediate-level supervision from the teacher model. Furthermore, this approach is generally limited to supervised learning contexts because the soft logits it depends upon represent class probability distributions.

Learn Before

Related