1Cademy - Computational Infeasibility of Full Output Summation in Distillation Loss

Student 1: Pr^s(A) = 0.6 , Pr^s(B) = 0.3 , Pr^s(C) = 0.1
Student 2: Pr^s(A) = 0.6 , Pr^s(B) = 0.1 , Pr^s(C) = 0.3

Learn Before

Cross-Entropy Loss for Knowledge Distillation

Problem

Computational Infeasibility of Full Output Summation in Distillation Loss

The direct application of the cross-entropy loss function for knowledge distillation is often computationally impractical. This is because the formula requires a summation over the entire set of possible outputs, which can be exponentially large, making the calculation infeasible in many real-world scenarios.

Updated 2025-10-04

Contributors are: