Causal Language Modeling as a Special Case of Masked Language Modeling
Conventional Causal Language Modeling (CLM) can be conceptualized as a specific instance of Masked Language Modeling (MLM). In this view, for any given position in a sequence, the prediction task is equivalent to an MLM task where all tokens in the right-context are masked. The model is then trained to predict the token at the current position using only its available left-context.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Causal Language Modeling as a Special Case of Masked Language Modeling
Example of Masked Language Modeling Prediction
Consider two different approaches for training a language model to predict a specific word within a sentence.
Approach 1: The model is trained to predict the next word in a sequence, using only the words that have appeared before it.
Approach 2: The model is trained to predict a word that has been intentionally hidden, using all the other visible words in the sentence, both those that come before and after the hidden word.
If both models are tasked with predicting the word 'jumps' in the sentence 'The quick brown fox jumps over the lazy dog', which statement correctly analyzes the contextual information available to each model for this specific task?
Choosing the Right Contextual Approach for Language Tasks
Match each description of a language model's prediction task or characteristic to the type of contextual information it utilizes.
Learn After
Consider the task of predicting the token 'fox' in the sequence 'The quick brown fox jumps'. To make a bidirectional model's prediction for 'fox' equivalent to that of a unidirectional (left-to-right) model, which set of tokens must be masked (i.e., hidden) from the bidirectional model's view?
Adapting a Bidirectional Model for a Unidirectional Task
A language model trained exclusively for next-token prediction (i.e., predicting a word based only on the words that precede it) can be framed as a specific implementation of a masked language model where, for every prediction, all subsequent tokens in the sequence are systematically masked.
Adapting a Bidirectional Model for Generative Tasks