1Cademy - A student model is trained to mimic a teacher model by minimizing the following loss function, which measures the dissimilarity between their output probability distributions for a given input:<br><br>$$ \text{Loss} = -\sum_{\mathbf{y}} \text{Pr}^t(\mathbf{y}) \log \text{Pr}_{\theta}^s(\mathbf{y}) $$<br><br>In this formula, $\text{Pr}^t(\mathbf{y})$ is the teachers probability for an output sequence $\mathbf{y}$, $\text{Pr}_{\theta}^s(\mathbf{y})$ is the students probability, and the summation is over all possible output sequences. What is the primary function of the summation ($\sum_{\mathbf{y}}$) over the entire space of possible outputs?

Learn Before

Cross-Entropy Loss for Knowledge Distillation

Multiple Choice

A student model is trained to mimic a teacher model by minimizing the following loss function, which measures the dissimilarity between their output probability distributions for a given input:

$\text{Loss} = -\sum_{\mathbf{y}} \text{Pr}^t(\mathbf{y}) \log \text{Pr}_{\theta}^s(\mathbf{y})$

In this formula, $\text{Pr}^t(\mathbf{y})$ is the teacher's probability for an output sequence $\mathbf{y}$ , $\text{Pr}_{\theta}^s(\mathbf{y})$ is the student's probability, and the summation is over all possible output sequences. What is the primary function of the summation ( $\sum_{\mathbf{y}}$ ) over the entire space of possible outputs?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related