Comparison of Masked vs. Causal Language Modeling
Causal Language Modeling (CLM), also known as conventional language modeling, can be understood as a specific instance of Masked Language Modeling (MLM). In CLM, the prediction of a token at a given position is constrained by masking all subsequent tokens in the right-hand context. This forces the model to rely exclusively on the preceding left-hand context, making it a unidirectional process. In contrast, the general MLM approach is bidirectional because it uses all unmasked tokens—from both the left and right contexts—to predict a masked token within a sequence.
0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Comparison of Masked vs. Causal Language Modeling
Formal Definition of the Masking Process in MLM
Example of Masked Language Modeling with Single and Multiple Masks
Training Objective of Masked Language Modeling (MLM)
Drawback of Masked Language Modeling: The [MASK] Token Discrepancy
Limitation of MLM: Ignoring Dependencies Between Masked Tokens
The Generator in Replaced Token Detection
Consecutive Token Masking in MLM
Token Selection and Modification Strategy in BERT's MLM
BERT's Masked Language Modeling Pre-training Pipeline
Performance Degradation and Early Stopping in Pre-training
Flexibility of Masked Language Modeling for Encoder-Decoder Training
Training Objective of the Standard BERT Model
During a self-supervised pre-training process, a model is given an input sequence where one word has been replaced by a special symbol, for example: 'The quick brown [MASK] jumps over the lazy dog.' The model's objective is to predict the original word, 'fox'. Which of the following is the direct input used by the final output layer to make this specific prediction?
Original Sequence for Masking and Deletion Examples
Arrange the following steps in the correct order to describe the process of pre-training an encoder model using a masked language modeling objective.
Evaluating a Pre-training Strategy for a Specific Application
Learn After
A language model is being developed specifically for a task that involves generating long, coherent passages of text, such as writing a story from an initial prompt. The model must generate the text sequentially, predicting each new word based only on the words that came before it. Which training approach is inherently structured for this type of task, and what is the key reason?
Identifying Language Modeling Approach
Match each characteristic to the language modeling approach it describes. The two approaches are 'Causal Language Modeling' and 'General Masked Language Modeling'.