Learn Before
Theory

MiniLM Deep Self-Attention Distillation (Wang et al., 2020)

MiniLM is a task-agnostic Transformer compression method introduced by Wang et al. (2020). A compact student Transformer is trained to mimic two components of the teacher's last self-attention layer: (i) the scaled dot-product attention distributions over keys, and (ii) a new value-relation matrix, defined as the scaled dot products between value vectors. Distilling only the last layer removes the need to align student-teacher layers explicitly, and an optional teacher assistant intermediates very large teacher-student gaps. The resulting student keeps the depth/width chosen by the practitioner (e.g., 66 layers, hidden size 384384) while preserving most of the teacher's downstream accuracy on GLUE and SQuAD, and is the backbone family from which the MiniLM-L6-H384 checkpoint is released.

0

1

Updated 2026-05-16

Contributors are:

Who are from:

Tags

Science

Related