1Cademy - Analyzing the Knowledge Distillation Hyperparameter

Learn Before

Combined Training Objective Formula for Knowledge Distillation

Short Answer

Analyzing the Knowledge Distillation Hyperparameter

Consider the combined training objective for a student model in knowledge distillation: $\tilde{\theta} = \arg \max_{\theta} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} \log \Pr_{\theta}^{s}(\mathbf{y}|\mathbf{x}) - \lambda \cdot \text{Loss}_{\text{kd}}$ Explain the potential negative consequence for the student model's performance if the hyperparameter λ is set to a very high value, and justify your explanation by referencing the two main components of the formula.

Updated 2025-10-03

Contributors are:

Who are from:

Learn Before

Related