Learn Before
Choosing a Loss Function for Model Distillation
You are tasked with training a small, efficient language model (the 'student' model) to mimic the behavior of a much larger, more powerful model (the 'teacher' model). The goal is to make the student's output probability distribution, , as close as possible to the teacher's distribution, , for any given input . You decide to use the measure of divergence between these two distributions as your loss function. Which of the two possible formulations of this measure should you choose to minimize, and why? Justify your choice by explaining the practical implications of your selection versus the alternative.
0
1
Tags
Data Science
Foundations of Large Language Models Course
Computing Sciences
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Formula for Soft Prompt Optimization by Minimizing KL Divergence
Derivation of the KL Divergence Objective for Policy Optimization
A machine learning model produces a probability distribution Q over a set of outcomes, aiming to approximate a true data distribution P. During evaluation, you observe that the divergence measure is low, while the reverse measure is high. Based on these results, what is the most likely characteristic of the model's distribution Q?
Calculating Divergence Between Distributions
Choosing a Loss Function for Model Distillation