Learn Before
Evaluating Model Performance via Cross-Entropy Loss
A language model is being trained to predict a masked token. For a specific training instance, the correct token is 'river'. Two different models, Model A and Model B, produce the probability distributions shown below for the masked position. Based on the goal of minimizing cross-entropy loss, which model is performing better on this specific instance? Justify your answer by explaining how the loss is calculated in this scenario.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.4 Alignment - Foundations of Large Language Models
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Probability of a True Token in MLM
Predicted Probability Distribution in MLM
Example of MLM Training Objective with Multiple Masks
MLM Loss Function as Negative Log-Likelihood
A language model is being trained to fill in a masked word. For the input 'The cat sat on the [MASK]', the correct word is 'mat'. The training objective is to adjust the model to minimize the cross-entropy loss for its predictions. Below are four different potential outputs from the model, showing the probability it assigns to the word 'mat'. Which of these outputs would result in the LOWEST loss for this specific training example?
Evaluating Model Performance via Cross-Entropy Loss
According to the standard Masked Language Modeling (MLM) training objective, a model's parameters are adjusted based on the cross-entropy loss calculated for a single, strategically chosen masked token within a training batch, aiming to optimize performance on that specific prediction.