Learn Before
Concept

Token-Level Loss Calculation in a Backward Pass

During the backward pass in training an autoregressive language model, the loss is calculated by comparing the model's predictions to the actual target tokens. A key aspect of this process is that the loss is computed only for the output or target portion of the sequence. For an input sequence like x1, x2, x3 and a target output y1, y2, the loss would be zero for the input tokens. Consequently, the gradients used to update the model's weights originate only from the positions of the target tokens (y1, y2), as these are the positions where a non-zero loss is calculated. These gradients are then propagated backward through the network to adjust the parameters.

Image 0

0

1

Updated 2026-04-19

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.2 Generative Models - Foundations of Large Language Models

Related