Equivalence of Training Objectives
An auto-regressive language model is trained to predict the next token in a sequence. The training objective is to minimize the cross-entropy loss between the model's predicted probability distribution and the true next token, which is represented as a one-hot vector. Explain mathematically why minimizing this cross-entropy loss for a single token prediction is equivalent to maximizing the log-likelihood of that true token.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A machine learning engineer is training a language model on a text corpus. During training, they plot two values at each step:
- The average negative log-likelihood of the target sequences.
- The cross-entropy loss between the model's predicted probability distributions and the one-hot encoded target tokens.
The engineer observes that the two plots are identical. Which of the following statements provides the most accurate mathematical justification for this observation?
Equivalence of Training Objectives
True or False: The mathematical equivalence between minimizing cross-entropy loss and maximizing the auto-regressive log-likelihood for a target sequence holds true regardless of how the ground-truth labels are represented (e.g., one-hot vectors vs. smoothed probability distributions).
Comparing Language Model Training Objectives