The distillation loss for feature-based knowledge transfer matches the feature maps of intermediate layers between teacher and student models. It is calculated as:

$$L_{FeaD}(f_t(x), f_s(x)) = L_F(\phi_t(f_t(x)), \phi_s(f_s(x)))$$

Where:
- $$f_t(x)$$ and $$f_s(x)$$ are the feature maps of the intermediate layers of the teacher and student models, respectively.
- $$\phi_t(\cdot)$$ and $$\phi_s(\cdot)$$ are transformation functions applied when the feature maps of the two models have different shapes.
- $$L_F(\cdot)$$ is the similarity function used for matching the models' feature maps.

University of California, Berkeley

Google

Feature-based knowledge builds on the concept of representation learning. It is especially suitable for training thinner and deeper networks. 

Feature-based knowledge


Deep neural networks are good at representation learning, or learning multiple levels of increasingly abstracted feature representation. This means the output of the last layer and the intermediate layers (feature maps) can be used as knowledge. 

Representation Learning


Intermediate representations tries to match the feature activations of the teacher and student. This was introduced to provide hints for training the student. 

Learn Before

Related