Log-Probability Decomposition for Efficient Multi-Turn Dialogue Training
To efficiently train a model on multi-turn dialogues in a single run, the entire alternating conversation is treated as a single concatenated sequence, . Its overall log-probability is decomposed into conditional probabilities for each turn. A key trick in supervised fine-tuning (SFT) for conversational models is that loss computation is applied exclusively to the model's responses, while the loss terms for generating the user's inputs are ignored (set to ). The decomposed log-probability is: . In this sum, terms predicting user inputs like are masked to , and only terms predicting responses like contribute to the training loss.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Maximum Likelihood Estimation for Sequential Data
Fine-Tuning as Maximum Likelihood Estimation
Log-Probability Decomposition for Efficient Multi-Turn Dialogue Training
A language model is being trained on a dataset containing a mix of very short sequences and a few extremely long sequences. A developer observes that the overall training objective, which is the sum of the log-probabilities of all sequences in the dataset, seems to be disproportionately influenced by the model's performance on the few long sequences. Which of the following best explains this observation?
Model Parameter Selection via Likelihood
A language model is being trained on a large dataset of text sequences. After a single parameter update, the model's calculated log-probability for one specific sequence in the dataset increases by 2.5, while the log-probabilities for all other sequences in the dataset remain exactly the same. How does this change affect the overall maximum likelihood training objective for the entire dataset?
Standard Optimization Objective for Transformer Language Models
Log-Probability Decomposition for Efficient Multi-Turn Dialogue Training
An engineer is training a dialogue model on a dataset of conversations, each containing multiple turns. Their current training script processes each conversation by performing a separate forward pass for every model response. For a conversation with K responses, this results in K forward passes. This approach is proving to be computationally very slow. Based on common practices for training such models, which of the following strategies provides the most significant improvement in training efficiency?
A two-turn dialogue consists of a user's initial prompt (
x^1), the model's response (y^1), the user's follow-up prompt (x^2), and the model's final response (y^2). To train a model efficiently in a single forward pass, these turns must be arranged into a single concatenated sequence. Arrange the following dialogue components into the correct sequence representation.Analysis of a Dialogue Sequence Representation
Learn After
A dialogue model is trained by processing entire multi-turn conversations as single, concatenated sequences of text. To make this process efficient, the training loss is calculated based only on the model's ability to predict certain parts of the sequence, while the log-probabilities of other parts are ignored. Given the following two-turn conversation, which parts of the sequence would be used to calculate the training loss?
- Turn 1 (User): 'What is the weather like'
- Turn 1 (Model): 'In which city?'
- Turn 2 (User): 'In London'
- Turn 2 (Model): 'It is currently raining.'
Debugging a Dialogue Model Training Loop
Evaluating Dialogue Model Training Strategies
Dataset-Level Objective for Multi-Round Conversational Models