Case Study

Language Model Training Step Analysis

A decoder-only language model is being trained. At one particular step, it must predict the next token after processing the input sequence 'A cat sat on'. The model's entire vocabulary is ['A', 'cat', 'sat', 'on', 'the', 'mat', '.']. The full training example is the sentence 'A cat sat on the mat.'. Given this information, what is the specific ground-truth target distribution that the model's output will be compared against to calculate the cross-entropy loss for this step? Explain your reasoning.

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science