Definition of Teacher's Probability Distribution (Pt) in Knowledge Distillation
In the context of knowledge distillation, represents the teacher model's output probability distribution. It is formally defined as a conditional probability, , which gives the probability of an output given a context and a latent variable .
0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Distillation Loss for Response-Based Knowledge
Objective Function for Student Model Training via Knowledge Distillation
Definition of Teacher's Probability Distribution (Pt) in Knowledge Distillation
Definition of Student's Probability Distribution (P_theta^s)
General Loss Function for Knowledge Distillation
Optimizing a Language Model for Mobile Deployment
Definition of Student's Probability Distribution ()
A research lab has developed a very large and complex language model that achieves state-of-the-art performance on a translation task. However, due to its size, the model is too slow and expensive to deploy for a real-time translation mobile app. To address this, the team uses the large model's predictions on a set of sentences to train a new, much smaller and faster model. What is the primary strategic advantage of this two-model approach?
A development team is using a knowledge distillation framework to create a compact, efficient language model (the 'student') from a much larger, high-performance model (the 'teacher'). The goal is to deploy the student model on devices with limited computational resources. Which statement best analyzes the typical relationship between the inputs processed by the teacher and student models during this process?
Learn After
KL Divergence Loss for Knowledge Distillation
Cross-Entropy Loss for Knowledge Distillation
A large, complex language model is used to generate target probabilities for training a smaller, more efficient model. For the input sentence 'The cat sat on the ___', the large model could produce different probability distributions for the next word. Which of the following distributions, representing , would provide the most informative and nuanced training signal for the smaller model?
Value of the Teacher's Probability Distribution
In a knowledge distillation process for a machine translation task, a large 'teacher' model translates the sentence 'Je suis content' from French to English. Instead of just outputting 'I am happy', the teacher model produces a full probability distribution over the entire English vocabulary for the next words. Which statement best analyzes the significance of this probability distribution () for training the smaller 'student' model?