Learn Before
Case Study

Choosing a Loss Function for Model Distillation

You are tasked with training a small, efficient language model (the 'student' model) to mimic the behavior of a much larger, more powerful model (the 'teacher' model). The goal is to make the student's output probability distribution, Q(yx)Q(y|x), as close as possible to the teacher's distribution, P(yx)P(y|x), for any given input xx. You decide to use the measure of divergence between these two distributions as your loss function. Which of the two possible formulations of this measure should you choose to minimize, and why? Justify your choice by explaining the practical implications of your selection versus the alternative.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Data Science

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science