Language Model Training Step Analysis
A decoder-only language model is being trained. At one particular step, it must predict the next token after processing the input sequence 'A cat sat on'. The model's entire vocabulary is ['A', 'cat', 'sat', 'on', 'the', 'mat', '.']. The full training example is the sentence 'A cat sat on the mat.'. Given this information, what is the specific ground-truth target distribution that the model's output will be compared against to calculate the cross-entropy loss for this step? Explain your reasoning.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Total Loss Calculation for a Token Sequence
An auto-regressive language model is being trained on the text sequence: 'The quick brown fox jumps'. At the training step where the model has processed the input 'The quick brown fox', what two quantities are compared by the cross-entropy loss function to calculate the error signal for updating the model's parameters?
Language Model Training Step Analysis
An auto-regressive language model is being trained on a large text corpus. At one training step, the model processes the input 'The cat sat on the' and must predict the next token. The actual next token in the training data is 'mat'. Which of the following predicted probability distributions for the next token would result in the lowest cross-entropy loss?