Learn Before
Comparison

Teacher-Student Model Architecture in Knowledge Distillation

In a knowledge distillation framework, a larger and more powerful 'teacher' model is used to train a 'student' model that is designed to be smaller and more efficient. The teacher model processes a full-context user input to generate its output probability, denoted as Prt(yc,z)Prt(y|c, z). In contrast, the student model processes a simplified context input to produce its own output, Prs(yc,z)Prs(y|c', z). The training objective is to transfer knowledge from the stronger teacher to the compact student by minimizing a loss function that measures the difference between their respective outputs.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models