A large, complex language model is used to generate target probabilities for training a smaller, more efficient model. For the input sentence 'The cat sat on the ___', the large model could produce different probability distributions for the next word. Which of the following distributions, representing , would provide the most informative and nuanced training signal for the smaller model?
0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
KL Divergence Loss for Knowledge Distillation
Cross-Entropy Loss for Knowledge Distillation
A large, complex language model is used to generate target probabilities for training a smaller, more efficient model. For the input sentence 'The cat sat on the ___', the large model could produce different probability distributions for the next word. Which of the following distributions, representing , would provide the most informative and nuanced training signal for the smaller model?
Value of the Teacher's Probability Distribution
In a knowledge distillation process for a machine translation task, a large 'teacher' model translates the sentence 'Je suis content' from French to English. Instead of just outputting 'I am happy', the teacher model produces a full probability distribution over the entire English vocabulary for the next words. Which statement best analyzes the significance of this probability distribution () for training the smaller 'student' model?