1Cademy - A student model is being trained to replicate the output distribution of a teacher model using the loss function: $$ \text{Loss} = -\sum_{\mathbf{y}} \text{Pr}^t(\mathbf{y}) \log \text{Pr}_{\theta}^s(\mathbf{y}) $$ Suppose for a given input, there are only three possible output sequences: A, B, and C. The teacher model assigns the following probabilities: - `Pr^t(A) = 0.8` - `Pr^t(B) = 0.15` - `Pr^t(C) = 0.05` Two different student models produce the following distributions: - **Student 1:** `Pr^s(A) = 0.6`, `Pr^s(B) = 0.3`, `Pr^s(C) = 0.1` - **Student 2:** `Pr^s(A) = 0.6`, `Pr^s(B) = 0.1`, `Pr^s(C) = 0.3` Without calculating the exact loss, which student model will achieve a lower loss value, and why?

Multiple Choice

A student model is being trained to replicate the output distribution of a teacher model using the loss function:

$\text{Loss} = -\sum_{\mathbf{y}} \text{Pr}^t(\mathbf{y}) \log \text{Pr}_{\theta}^s(\mathbf{y})$

Suppose for a given input, there are only three possible output sequences: A, B, and C. The teacher model assigns the following probabilities:

Two different student models produce the following distributions:

Without calculating the exact loss, which student model will achieve a lower loss value, and why?

0

1

Updated 2025-10-08

Contributors are:

Who are from: