1Cademy - Objective Function for Student Model Training via Knowledge Distillation

Learn Before

Teacher-Student Model Architecture in Knowledge Distillation
Objective for Distribution Matching in Fine-Tuning

Formula

Objective Function for Student Model Training via Knowledge Distillation

The optimal parameters $\hat{\theta}$ for a student model are found by minimizing a loss function over a dataset $\mathcal{D}'$ of simplified inputs. This process is defined by the formula: $\hat{\theta} = \arg\min_{\theta} \sum_{\mathbf{x}' \in \mathcal{D}'} \text{Loss}(\text{Pr}^t(\cdot|\cdot), \text{Pr}_{\theta}^s(\cdot|\cdot), \mathbf{x}')$ . The loss function measures the discrepancy between the output probability distribution of the teacher model, $\text{Pr}^t(\cdot|\cdot)$ , and that of the student model, $\text{Pr}_{\theta}^s(\cdot|\cdot)$ , for each input $\mathbf{x}'$ .