Learn Before
Training BERT-based NER Models
For Named Entity Recognition (NER) tasks using a BERT-based model, the model outputs a probability distribution, denoted as , over the set of possible tags for each token at position . The training or fine-tuning process optimizes the model's parameters by using these distributions. A common training loss is the negative log-likelihood, which is calculated based on , the model's predicted probability of the correct tag at each position.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Illustration of BERT-based Architecture for Named Entity Recognition
Training BERT-based NER Models
BERT-based Architecture for Span Prediction
An engineer is using a pre-trained transformer model to build a system that assigns a grammatical tag (e.g., Noun, Verb, Adjective) to every word in a sentence. After the model processes the input and generates a final hidden state vector for each token, which of the following is the most appropriate architectural choice to generate the tag for each specific word?
A developer is building a model to assign a specific category (e.g., 'Person', 'Location', 'Organization') to each word in a sentence. The model's architecture involves using a large, pre-trained component to understand the context of each word. Arrange the following steps in the correct chronological order that describes how this model processes an input sentence to generate a label for each word.
An engineer is building a system to identify and tag specific medical terms (e.g., 'symptom', 'disease', 'medication') within clinical notes. They are using a large, pre-trained transformer-based model that processes an entire sentence and outputs a contextualized vector representation for each input token. Which of the following describes the most effective and standard final layer design for this token-level classification task?
Learn After
Negative Log-Likelihood Loss for NER
A model for Named Entity Recognition is being trained. During one step, it processes a sentence and produces the probability distributions below for two of the words. The training process aims to adjust the model's parameters by calculating a loss based on the predicted probability of the correct, ground-truth tag for each word.
Word: 'Anya' (Ground-truth tag:
I-PER)B-PER: 0.05I-PER: 0.85O: 0.10
Word: 'Berlin' (Ground-truth tag:
B-LOC)B-LOC: 0.10B-ORG: 0.45O: 0.45
Based on this information, which word's prediction will contribute a larger value to the overall training loss for this step, and why?
Model Parameter Adjustment during Training
Consider a model being trained to assign a category tag (e.g., 'Person', 'Location', 'Other') to each word in a sentence. If, for a specific word, the model's output assigns a very high probability (e.g., 0.98) to the correct, ground-truth tag, the training process will make a large adjustment to the model's parameters based on this specific word's prediction.