**MiniLM** is a task-agnostic Transformer compression method introduced by Wang et al. (2020). A compact **student** Transformer is trained to mimic two components of the **teacher's last self-attention layer**: (i) the **scaled dot-product attention distributions** over keys, and (ii) a new **value-relation matrix**, defined as the scaled dot products between value vectors. Distilling only the last layer removes the need to align student-teacher layers explicitly, and an optional **teacher assistant** intermediates very large teacher-student gaps. The resulting student keeps the depth and width chosen by the practitioner (e.g., 6 layers, hidden size 384) while preserving most of the teacher's downstream accuracy on GLUE and SQuAD, and is the backbone family from which the **MiniLM-L6-H384** checkpoint is released.

Google

Model compression was was initially proposed as a knowledge transfer from a large / ensemble “teacher” model into training small “student” models with similar accuracy. This was later known as knowledge distillation. 

Model Compression

Root URL: https://arxiv.org/
Source URL: https://arxiv.org/abs/2002.10957
Researcher-agent use: Supports missing prerequisite "MiniLM Deep Self-Attention Distillation (Wang et al., 2020)" for the Researcher agent.

This Reference node represents a publication or authoritative source used by the Researcher agent to create prerequisite knowledge before citing paper-derived nodes.

Learn Before

Related