Learn Before
Negative Log-Likelihood Loss for NER
The training loss for a Named Entity Recognition (NER) model is commonly defined as the average negative log-likelihood of the correct tags. This loss function aims to maximize the probability assigned to the ground-truth tag for each token in a sequence. The formula is given by:
Where:
- is the total number of tokens in the sequence.
- is the probability that the model predicts for the correct tag, , at position .

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.1 Pre-training - Foundations of Large Language Models
Related
Negative Log-Likelihood Loss for NER
A model for Named Entity Recognition is being trained. During one step, it processes a sentence and produces the probability distributions below for two of the words. The training process aims to adjust the model's parameters by calculating a loss based on the predicted probability of the correct, ground-truth tag for each word.
Word: 'Anya' (Ground-truth tag:
I-PER)B-PER: 0.05I-PER: 0.85O: 0.10
Word: 'Berlin' (Ground-truth tag:
B-LOC)B-LOC: 0.10B-ORG: 0.45O: 0.45
Based on this information, which word's prediction will contribute a larger value to the overall training loss for this step, and why?
Model Parameter Adjustment during Training
Consider a model being trained to assign a category tag (e.g., 'Person', 'Location', 'Other') to each word in a sentence. If, for a specific word, the model's output assigns a very high probability (e.g., 0.98) to the correct, ground-truth tag, the training process will make a large adjustment to the model's parameters based on this specific word's prediction.
Learn After
Calculating Model Training Loss
A model is being trained for a text labeling task where the goal is to maximize the probability assigned to the correct label for each word. The training loss is calculated as the average of the negative logarithm of these probabilities. Consider the model's performance on one sentence, evaluated by two different sets of parameters (Model A and Model B). The table below shows the probability each model assigned to the correct label for each of the seven words in the sentence.
Word Model A Probability Model B Probability Word 1 0.9 0.8 Word 2 0.8 0.6 Word 3 0.7 0.6 Word 4 0.9 0.8 Word 5 0.9 0.8 Word 6 0.1 0.7 Word 7 0.9 0.8 Based on this data, which model would have a lower training loss for this specific sentence, and why?
Impact of Model Confidence on Training Loss