1Cademy - Value of the Teachers Probability Distribution

Learn Before

Definition of Teacher's Probability Distribution (Pt) in Knowledge Distillation

Short Answer

Value of the Teacher's Probability Distribution

In a knowledge distillation setup, a large 'teacher' model is given the user input 'My order is late and I'm frustrated' (context c) and an internal instruction to be 'empathetic' (latent variable z). The teacher then produces a probability distribution, Pt = Pr^t(response|c, z), over possible responses. Explain why this full probability distribution is a more valuable training signal for a smaller 'student' model than simply providing the single most likely response (e.g., 'I'm sorry to hear that.').

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related