Selective Loss Computation in Joint Probability Language Modeling
In language model training that targets the joint probability of a concatenated sequence seq_x,y, the log-probability is decomposed using the chain rule: For practical training, the loss is computed only on the conditional probability term, log Pr_θ(y|x), which corresponds to the output tokens y. The loss contribution from the marginal probability term, log Pr_θ(x), which corresponds to the input tokens x, is effectively set to zero.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
SFT as Language Model Training on Concatenated Sequences
Calculating Conditional Log-Probability Using an LLM
Selective Loss Computation in Joint Probability Language Modeling
Calculating Conditional Log-Probability
An engineer is evaluating a language model and calculates the following log-probabilities for an input sequence
xand an output sequencey: the joint log-probabilitylog Pr([x, y])and the marginal log-probabilitylog Pr(x). They observe that the value oflog Pr([x, y])is significantly more negative than the value oflog Pr(x). Based on the fundamental relationship between joint, conditional, and marginal probabilities, what is the most accurate conclusion?A language model is being evaluated. For a given input sequence
xand a potential output sequencey, the model calculateslog Pr([x, y]) = -3.5andlog Pr(x) = -5.2. Based on these values, it is reasonable to conclude that the model's probability calculations are functioning correctly.
Learn After
A language model is being trained on instruction-following data. For one specific training instance, the model processes the full tokenized sequence:
['User:', 'What', 'is', '2+2?', 'Assistant:', '4']. The goal is to train the model to provide the correct response ('4') when given the user's prompt. During the backpropagation step for this single instance, on which token(s) is the predictive loss calculated to update the model's weights?Diagnosing a Faulty Language Model Training Process
A machine learning engineer is training a language model for a question-answering task. The training data consists of concatenated
[question, answer]sequences. Due to a configuration error, the training loss is calculated across all tokens in the sequence (both question and answer), instead of only on the answer tokens. What is the most likely and significant negative consequence of this misconfiguration on the model's behavior?Loss Masking via Forward and Backward Passes in SFT