Recent methods include teacher-student learning, assistant teaching, lifelong learning, and self-learning. 


Extension of Methods 


Model transfer of knowledge can be extended to other tasks, e.g. adversarial attacks, data augmentation, data privacy and security. The idea of knowledge transfer can be applied to dataset distillation, which transfers the knowledge from a large dataset to a smaller one. 


Extension of Concept 


Extensions of knowledge distillation focus on compressing deep neural networks. 


University of California, Berkeley

Knowledge distillation is a type of model compression and acceleration technique. It counteracts the challenge of deploying efficient deep learning models on devices with limited resources (e.g. mobile devices and embedded systems) due to computational complexity and storage requirements. 

Knowledge Distillation

Gou, J., Yu, B., Maybank, S.J., & Tao, D. (2021). Knowledge Distillation: A Survey. Int. J. Comput. Vis., 129, 1789-1819.

Link: https://arxiv.org/pdf/2006.05525.pdf

Knowledge Distillation: A Survey

- Knowledge
- Distillation algorithm
- Teacher-student architecture


Components of a Knowledge Distillation System


Extensions

Student models can be applied to visual and speech recognition and NLP. Knowledge distillation has been used for label smoothing, accessing the teacher’s accuracy, and for obtaining a prior for the optimal output layer geometry. 


Applications 

The teacher learns relationships between activations, neurons, or sample pairs. The logits or parameters of a large deep model are used as the teacher knowledge. Activations, neurons, or features of intermediate layers can also guide learning. 



KD Workflow 

The principles of knowledge distillation can be broadly applied to transfer prompting knowledge into a student model's parameters. By training the student model to replicate the outputs of a teacher model, the knowledge embedded in the prompt is effectively distilled. Consequently, the student model can be viewed as having encoded this distilled knowledge in the form of a soft prompt.

Distilling Prompting Knowledge into Soft Prompts

A company has developed a highly accurate, but very large and computationally intensive, language model for sentiment analysis. They want to deploy this feature on a mobile app, where processing power, memory, and network latency are significant constraints. Propose a strategy to create a smaller, faster model suitable for the mobile app that leverages the existing large model, without simply training a new small model from scratch on the original dataset. Describe the roles of the original model and the new model in your proposed process.

Efficient Model Deployment for Mobile Applications

A machine learning team is developing a compact model for a mobile application. They have a large, highly accurate 'teacher' model and a smaller 'student' model architecture. Instead of training the student model directly on the original dataset with its ground-truth labels (e.g., 'this image is a cat'), they train it to mimic the full output probability distribution of the teacher model (e.g., '90% cat, 5% dog, 1% tiger...'). Why is this technique often more effective for the student model's performance than training it from scratch on the original labels?

In the context of training a smaller 'student' model from a larger 'teacher' model, describe two distinct types of 'knowledge' that can be transferred from the teacher to the student, beyond just the final predictions. For each type, briefly explain how it helps the student model learn more effectively.

Mechanisms of Knowledge Transfer

Context distillation is a knowledge distillation method designed to adapt large language models (LLMs) to follow simplified instructions. It involves training a student model to make predictions based on user inputs and simplified contexts (such as condensed instructions). This is achieved by transferring knowledge from a well-trained, instruction-following teacher model that processes the original, detailed instructions. The student model learns by minimizing the loss between its predictions and those produced by the teacher model.

Learn Before

Related

Learn After