Total Loss Calculation for a Token Sequence
The total loss for a given sequence of tokens is computed by summing the individual losses over each position from to . At each position , a loss function measures the discrepancy between the model's predicted probability distribution for the next token () and the actual ground-truth distribution (). This is expressed generally as:
In natural language processing, this loss function is typically the log-scale cross-entropy loss, leading to the specific formula:

0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.3 Prompting - Foundations of Large Language Models
Related
Total Loss Calculation for a Token Sequence
An auto-regressive language model is being trained on the text sequence: 'The quick brown fox jumps'. At the training step where the model has processed the input 'The quick brown fox', what two quantities are compared by the cross-entropy loss function to calculate the error signal for updating the model's parameters?
Language Model Training Step Analysis
An auto-regressive language model is being trained on a large text corpus. At one training step, the model processes the input 'The cat sat on the' and must predict the next token. The actual next token in the training data is 'mat'. Which of the following predicted probability distributions for the next token would result in the lowest cross-entropy loss?
Log-Likelihood Objective for Language Model Training
Formulating the MLE Objective for a Small Dataset
Total Loss Calculation for a Token Sequence
A model is being trained on a dataset containing just two sequences:
seq_1 = (x_0, x_1)andseq_2 = (y_0, y_1, y_2). According to the principle of maximum likelihood estimation for sequential data, which expression correctly represents the decomposed log-probability that the model aims to maximize for this entire dataset?When training a model on a sequence of data using the Maximum Likelihood Estimation objective, a single prediction with a very low conditional probability for one element in the sequence can have a disproportionately large negative impact on the total log-probability calculated for that entire sequence.
Pre-trained Language Model Decoder Inference
Loss Function for RNN
Sample-wise Negative Log-Likelihood Loss for a Sub-sequence
Cross-Entropy Loss for Knowledge Distillation
A language model is being trained to generate the four-word sentence 'The quick brown fox'. The model generates one word at a time, and the error (loss) is calculated at each step:
- Loss for 'The' = 0.1
- Loss for 'quick' = 0.3
- Loss for 'brown' = 0.2
- Loss for 'fox' = 0.4
To update the model's parameters, the training process computes a single, overall loss value for the entire sentence. Which statement best analyzes this method of calculating the overall loss?
Total Loss Calculation for a Token Sequence
Calculating Average Sequence-Level Loss
Evaluating Training Strategies for a Translation Model
A language model is being trained to predict the next word in a sequence. The training process aims to minimize a loss value, which measures the difference between the model's predicted probability distribution for the next word and the actual correct word. Consider two separate predictions for the next word after the phrase 'The sun is shining...':
- Prediction A: The model assigns a probability of 0.75 to the correct word, 'brightly'.
- Prediction B: The model assigns a probability of 0.15 to the correct word, 'brightly'.
Which of the following statements accurately analyzes the loss values for these two predictions?
Total Loss Calculation for a Token Sequence
Evaluating Model Prediction Quality
Defining the Ground Truth Distribution
Learn After
Pre-training Objective for Language Models
Example of a Token Sequence
Example of an Indexed Token Sequence
A language model is evaluated on a sequence of four tokens,
(x_0, x_1, x_2, x_3). The model's performance is measured by calculating a loss value at each step of the sequence generation. The individual losses are as follows: the loss for predicting tokenx_1is 1.2, the loss for predictingx_2is 0.5, and the loss for predictingx_3is 2.3. Based on this information, what is the total loss for the entire token sequence?Comparative Model Performance Analysis
A language model's performance is being evaluated on the token sequence
('The', 'cat', 'sat', 'on'). The total loss for this sequence is calculated by summing the individual losses from each predictive step. Which of the following sets of predictions contributes to this total loss calculation?Ground-Truth Distribution as a One-Hot Representation