1Cademy - A machine learning team is using a combined objective to train a small student model. The goal is to find the student models parameters (θ) that maximize the following expression: $$ \tilde{\theta} = \arg \max_{\theta} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} \log \Pr_{\theta}^{s}(\mathbf{y}|\mathbf{x}) - \lambda \cdot \text{Loss}_{\text{kd}} $$ The first term, $\log \Pr_{\theta}^{s}(\mathbf{y}|\mathbf{x})$, measures how well the student predicts the ground-truth labels $(\mathbf{y})$. The second term, $\text{Loss}_{\text{kd}}$, measures the difference between the students and a larger teacher models predictions. The team is working with a dataset where the ground-truth labels are known to be somewhat noisy and contain occasional errors. However, the large teacher model has been shown to provide very reliable and well-generalized predictions. Given this situation, how should the team adjust the hyperparameter $\lambda$ to optimize the student models performance?

Learn Before

Combined Training Objective Formula for Knowledge Distillation

Multiple Choice

A machine learning team is using a combined objective to train a small 'student' model. The goal is to find the student model's parameters (θ) that maximize the following expression: $\tilde{\theta} = \arg \max_{\theta} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} \log \Pr_{\theta}^{s}(\mathbf{y}|\mathbf{x}) - \lambda \cdot \text{Loss}_{\text{kd}}$ The first term, $\log \Pr_{\theta}^{s}(\mathbf{y}|\mathbf{x})$ , measures how well the student predicts the ground-truth labels $(\mathbf{y})$ . The second term, $\text{Loss}_{\text{kd}}$ , measures the difference between the student's and a larger 'teacher' model's predictions. The team is working with a dataset where the ground-truth labels are known to be somewhat noisy and contain occasional errors. However, the large teacher model has been shown to provide very reliable and well-generalized predictions. Given this situation, how should the team adjust the hyperparameter $\lambda$ to optimize the student model's performance?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related