Learn Before
Loss Function for Language Modeling
To train a language model, such as a decoder-only architecture, the standard approach is to minimize a loss function over a collection of token sequences. This function, denoted as , measures the discrepancy between the model's predicted probability distribution and the true, gold-standard distribution at each position. In natural language processing, this difference is typically quantified using the log-scale cross-entropy loss.

0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Fundamental LLM Training Objective
LLM Policy as a Probability Distribution
A language model is given the context: 'The chef carefully added the final, crucial ingredient to the simmering stew: a pinch of...'. The model must predict the next word. Below are the conditional probabilities,
Pr(next_word | context), calculated by two different models for four possible next words.Next Word Model A Probability Model B Probability salt 0.65 0.20 concrete 0.02 0.45 laughter 0.03 0.15 thyme 0.30 0.20 Based on this data, which of the following statements is the most accurate analysis of the models' understanding of the context?
Mathematical Notation for Text Generation Probability
Evaluating Language Model Suitability
Predicting Next-Word Likelihood
Loss Function for Language Modeling
A Broad Definition of Cross Entropy
Why we want to minimize cross-entropy loss?
Denoising Autoencoder Training Objective
MLM Training Objective using Cross-Entropy Loss
Consider a binary classification task where the correct label for a specific instance is
1. A model makes two different predictions for this instance: Prediction A is0.9and Prediction B is0.6. According to the cross-entropy loss function, which statement accurately compares the loss for these two predictions?Calculating Cross-Entropy Loss
Analyzing Model Errors with Cross-Entropy Loss
Loss Function for Language Modeling
Learn After
A language model is being trained to predict the next word in a sequence. The training process aims to minimize a loss value, which measures the difference between the model's predicted probability distribution for the next word and the actual correct word. Consider two separate predictions for the next word after the phrase 'The sun is shining...':
- Prediction A: The model assigns a probability of 0.75 to the correct word, 'brightly'.
- Prediction B: The model assigns a probability of 0.15 to the correct word, 'brightly'.
Which of the following statements accurately analyzes the loss values for these two predictions?
Total Loss Calculation for a Token Sequence
Evaluating Model Prediction Quality
Defining the Ground Truth Distribution