1Cademy - Selective Loss Computation in Joint Probability Language Modeling

Learn Before

Relationship Between Joint, Conditional, and Marginal Log-Probabilities of Sequences

Formula

Selective Loss Computation in Joint Probability Language Modeling

In language model training that targets the joint probability of a concatenated sequence seq_x,y, the log-probability is decomposed using the chain rule: $\log \text{Pr}_{\theta}(\text{seq}_{\mathbf{x},\mathbf{y}}) = \log \text{Pr}_{\theta}(\mathbf{x}, \mathbf{y}) = \underbrace{\log \text{Pr}_{\theta}(\mathbf{x})}_{\text{set to 0}} + \underbrace{\log \text{Pr}_{\theta}(\mathbf{y}|\mathbf{x})}_{\text{loss computation}}$ For practical training, the loss is computed only on the conditional probability term, log Pr_θ(y|x), which corresponds to the output tokens y. The loss contribution from the marginal probability term, log Pr_θ(x), which corresponds to the input tokens x, is effectively set to zero.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After