Short Answer

Value of the Teacher's Probability Distribution

In a knowledge distillation setup, a large 'teacher' model is given the user input 'My order is late and I'm frustrated' (context c) and an internal instruction to be 'empathetic' (latent variable z). The teacher then produces a probability distribution, Pt = Pr^t(response|c, z), over possible responses. Explain why this full probability distribution is a more valuable training signal for a smaller 'student' model than simply providing the single most likely response (e.g., 'I'm sorry to hear that.').

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science