Comparison

Conditional vs. Joint Probability Objectives in Language Modeling

A fundamental difference exists between standard language modeling and supervised fine-tuning objectives. Standard language modeling minimizes the loss over all tokens of a concatenated input-output sequence seqx,y=[x,y]\mathrm{seq}_{\mathbf{x},\mathbf{y}} = [\mathbf{x},\mathbf{y}], optimizing the joint log-probability logPrθ(x,y)\log \mathrm{Pr}_{\theta}(\mathbf{x},\mathbf{y}). In contrast, fine-tuning focuses on the conditional log-probability of the output. By applying the chain rule, the joint sequence probability is decomposed into the probability of the input and the conditional probability of the output: logPrθ(seqx,y)=logPrθ(x)+logPrθ(yx)\log \mathrm{Pr}_{\theta}(\mathrm{seq}_{\mathbf{x},\mathbf{y}}) = \log \mathrm{Pr}_{\theta}(\mathbf{x}) + \log \mathrm{Pr}_{\theta}(\mathbf{y}|\mathbf{x}). In the fine-tuning context, the loss computation over the input x\mathbf{x} is set to 0{}0, meaning the loss is computed exclusively for the output tokens y\mathbf{y}.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Related