1Cademy - Language Model Training Step Analysis

Learn Before

Training Decoder-Only Language Models with Cross-Entropy Loss

Case Study

Language Model Training Step Analysis

A decoder-only language model is being trained. At one particular step, it must predict the next token after processing the input sequence 'A cat sat on'. The model's entire vocabulary is ['A', 'cat', 'sat', 'on', 'the', 'mat', '.']. The full training example is the sentence 'A cat sat on the mat.'. Given this information, what is the specific ground-truth target distribution that the model's output will be compared against to calculate the cross-entropy loss for this step? Explain your reasoning.

Updated 2025-10-03

Contributors are:

Who are from:

Learn Before

Related