Learn Before
BERT Loss Function
The total training loss for a standard BERT model is calculated by summing the individual losses from its two pre-training tasks: masked language modeling (MLM) and next sentence prediction (NSP). The formula is expressed as: .

0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Dive into Deep Learning
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
D2L
Dive into Deep Learning @ D2L
Related
BERT Loss Function
Concurrent Loss Calculation for MLM and NSP
A researcher is pre-training a large language model using a dual-task objective. The model is simultaneously trained on two tasks:
- Predicting randomly obscured words within a given text.
- Determining if two text segments presented together originally appeared consecutively. The final training update is based on the model's combined performance on both tasks. Which of the following statements best analyzes the primary advantage of this specific dual-task approach?
Evaluating a Modified Pre-training Strategy
The original pre-training process for the Bidirectional Encoder Representations from Transformers model involves a dual-task objective where the total loss is the sum of the losses from two distinct tasks. Match each training task to its corresponding description.
Learn After
BERT Training Process
An engineer is pre-training a language model that simultaneously learns to predict masked words in a sentence and to determine if two sentences are consecutive. In a single training step, the loss for the masked word prediction task is calculated as 1.8, and the loss for the sentence relationship task is 0.6. What is the total loss value that will be used to update the model's parameters for this step?
Analyzing Language Model Training Loss
Analyzing Dual-Task Model Training Performance