1Cademy - Log-Probability Decomposition for Efficient Multi-Turn Dialogue Training

Learn Before

Formula

Log-Probability Decomposition for Efficient Multi-Turn Dialogue Training

To efficiently train a model on multi-turn dialogues in a single run, the entire alternating conversation is treated as a single concatenated sequence, $\mathrm{seq} = [\mathbf{x}^1, \mathbf{y}^1, \dots, \mathbf{x}^K, \mathbf{y}^K]$ . Its overall log-probability is decomposed into conditional probabilities for each turn. A key trick in supervised fine-tuning (SFT) for conversational models is that loss computation is applied exclusively to the model's responses, while the loss terms for generating the user's inputs are ignored (set to ${}0$ ). The decomposed log-probability is: $\log \mathrm{Pr}_{\theta}(\mathrm{seq}) = \log \mathrm{Pr}_{\theta}(\mathbf{x}^1) + \log \mathrm{Pr}_{\theta}(\mathbf{y}^1/\mathbf{x}^1) + \dots + \log \mathrm{Pr}_{\theta}(\mathbf{x}^K/\mathbf{x}^1,\mathbf{y}^1,\dots,\mathbf{y}^{K-1}) + \log \mathrm{Pr}_{\theta}(\mathbf{y}^K/\mathbf{x}^1,\mathbf{y}^1,\dots,\mathbf{x}^K)$ . In this sum, terms predicting user inputs like $\log \mathrm{Pr}_{\theta}(\mathbf{x}^1)$ are masked to ${}0$ , and only terms predicting responses like $\log \mathrm{Pr}_{\theta}(\mathbf{y}^k/\dots)$ contribute to the training loss.

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After