Concept

Multi-level Knowledge Distillation in BERT

The application of knowledge distillation to BERT can be performed at multiple levels of its architecture. Beyond matching the final output predictions of the teacher model, it is also possible to distill knowledge from the intermediate hidden layers. This is achieved by incorporating a training loss that encourages the student model's hidden layer outputs to mimic those of the teacher model.

0

1

Updated 2026-04-17

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences