Formula

Log-Probability Decomposition for Efficient Multi-Turn Dialogue Training

To efficiently train a model on multi-turn dialogues in a single run, the entire alternating conversation is treated as a single concatenated sequence, seq=[x1,y1,,xK,yK]\mathrm{seq} = [\mathbf{x}^1, \mathbf{y}^1, \dots, \mathbf{x}^K, \mathbf{y}^K]. Its overall log-probability is decomposed into conditional probabilities for each turn. A key trick in supervised fine-tuning (SFT) for conversational models is that loss computation is applied exclusively to the model's responses, while the loss terms for generating the user's inputs are ignored (set to 0{}0). The decomposed log-probability is: logPrθ(seq)=logPrθ(x1)+logPrθ(y1/x1)++logPrθ(xK/x1,y1,,yK1)+logPrθ(yK/x1,y1,,xK)\log \mathrm{Pr}_{\theta}(\mathrm{seq}) = \log \mathrm{Pr}_{\theta}(\mathbf{x}^1) + \log \mathrm{Pr}_{\theta}(\mathbf{y}^1/\mathbf{x}^1) + \dots + \log \mathrm{Pr}_{\theta}(\mathbf{x}^K/\mathbf{x}^1,\mathbf{y}^1,\dots,\mathbf{y}^{K-1}) + \log \mathrm{Pr}_{\theta}(\mathbf{y}^K/\mathbf{x}^1,\mathbf{y}^1,\dots,\mathbf{x}^K). In this sum, terms predicting user inputs like logPrθ(x1)\log \mathrm{Pr}_{\theta}(\mathbf{x}^1) are masked to 0{}0, and only terms predicting responses like logPrθ(yk/)\log \mathrm{Pr}_{\theta}(\mathbf{y}^k/\dots) contribute to the training loss.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related