Formula

Selective Loss Computation in Joint Probability Language Modeling

In language model training that targets the joint probability of a concatenated sequence seq_x,y, the log-probability is decomposed using the chain rule: logPrθ(seqx,y)=logPrθ(x,y)=logPrθ(x)set to 0+logPrθ(yx)loss computation\log \text{Pr}_{\theta}(\text{seq}_{\mathbf{x},\mathbf{y}}) = \log \text{Pr}_{\theta}(\mathbf{x}, \mathbf{y}) = \underbrace{\log \text{Pr}_{\theta}(\mathbf{x})}_{\text{set to 0}} + \underbrace{\log \text{Pr}_{\theta}(\mathbf{y}|\mathbf{x})}_{\text{loss computation}} For practical training, the loss is computed only on the conditional probability term, log Pr_θ(y|x), which corresponds to the output tokens y. The loss contribution from the marginal probability term, log Pr_θ(x), which corresponds to the input tokens x, is effectively set to zero.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences