1Cademy - Conditional vs. Joint Probability Objectives in Language Modeling

Learn Before

Comparison

Conditional vs. Joint Probability Objectives in Language Modeling

A fundamental difference exists between standard language modeling and supervised fine-tuning objectives. Standard language modeling minimizes the loss over all tokens of a concatenated input-output sequence $\mathrm{seq}_{\mathbf{x},\mathbf{y}} = [\mathbf{x},\mathbf{y}]$ , optimizing the joint log-probability $\log \mathrm{Pr}_{\theta}(\mathbf{x},\mathbf{y})$ . In contrast, fine-tuning focuses on the conditional log-probability of the output. By applying the chain rule, the joint sequence probability is decomposed into the probability of the input and the conditional probability of the output: $\log \mathrm{Pr}_{\theta}(\mathrm{seq}_{\mathbf{x},\mathbf{y}}) = \log \mathrm{Pr}_{\theta}(\mathbf{x}) + \log \mathrm{Pr}_{\theta}(\mathbf{y}|\mathbf{x})$ . In the fine-tuning context, the loss computation over the input $\mathbf{x}$ is set to ${}0$ , meaning the loss is computed exclusively for the output tokens $\mathbf{y}$ .

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After