1Cademy - In a model training setup, a smaller student model is trained to mimic the output probability distribution of a larger teacher model for a given input. The training objective is to minimize the Kullback-Leibler (KL) divergence between the two distributions. The standard loss function is defined as $Loss_A = \text{KL}(\text{Teacher Distribution} || \text{Student Distribution})$. A researcher proposes an alternative loss function, $Loss_B = \text{KL}(\text{Student Distribution} || \text{Teacher Distribution})$. How would minimizing $Loss_B$ instead of $Loss_A$ most likely change the student models behavior?

Learn Before

Knowledge Distillation Loss using KL Divergence

Multiple Choice

In a model training setup, a smaller 'student' model is trained to mimic the output probability distribution of a larger 'teacher' model for a given input. The training objective is to minimize the Kullback-Leibler (KL) divergence between the two distributions. The standard loss function is defined as $Loss_A = \text{KL}(\text{Teacher Distribution} || \text{Student Distribution})$ . A researcher proposes an alternative loss function, $Loss_B = \text{KL}(\text{Student Distribution} || \text{Teacher Distribution})$ . How would minimizing $Loss_B$ instead of $Loss_A$ most likely change the student model's behavior?

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related