Short Answer

Rationale for Using Logits in Distillation Loss

In the context of transferring knowledge from a large 'teacher' model to a smaller 'student' model, why is it often more effective to calculate the divergence loss directly on the raw, pre-activation output vectors (logits) of the two models, rather than on the final probability distributions produced after applying a standard activation function (like softmax)?

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Deep Learning (in Machine learning)

Data Science

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science