BERT Loss Function
The total training loss for the BERT model is calculated by summing the individual losses from its two pre-training objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). The formula is expressed as: .

0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
BERT Loss Function
Concurrent Loss Calculation for MLM and NSP
A researcher is pre-training a large language model using a dual-task objective. The model is simultaneously trained on two tasks:
- Predicting randomly obscured words within a given text.
- Determining if two text segments presented together originally appeared consecutively. The final training update is based on the model's combined performance on both tasks. Which of the following statements best analyzes the primary advantage of this specific dual-task approach?
Evaluating a Modified Pre-training Strategy
The original pre-training process for the Bidirectional Encoder Representations from Transformers model involves a dual-task objective where the total loss is the sum of the losses from two distinct tasks. Match each training task to its corresponding description.