The distillation loss for feature-based knowledge transfer matches the feature maps of intermediate layers between teacher and student models. It is calculated as:

$$L_{FeaD}(f_t(x), f_s(x)) = L_F(\phi_t(f_t(x)), \phi_s(f_s(x)))$$

Where:
- $$f_t(x)$$ and $$f_s(x)$$ are the feature maps of the intermediate layers of the teacher and student models, respectively.
- $$\phi_t(\cdot)$$ and $$\phi_s(\cdot)$$ are transformation functions applied when the feature maps of the two models have different shapes.
- $$L_F(\cdot)$$ is the similarity function used for matching the models' feature maps.

University of California, Berkeley

Google

Feature-based knowledge is a knowledge distillation approach that uses the feature maps produced by a teacher model's intermediate layers, not just its final output, as knowledge to transfer to a student model. It builds on the concept of representation learning, since deep networks learn multiple levels of increasingly abstracted feature representations, and it is especially suitable for training thinner and deeper student networks.

Feature-based knowledge

Distillation Loss for Feature-Based Knowledge

Deep neural networks excel at representation learning by developing multiple levels of increasingly abstracted feature representations. Because of this capability, the output of the last layer and the intermediate layers (feature maps) can be extracted and utilized as knowledge.

Feature Maps as Knowledge

Matching intermediate representations between a teacher and a student model provides hints for training the student. This technique aligns the feature activations of the student with those of the teacher to facilitate feature-based knowledge transfer.

Learn Before

Related