Encoder Processing of a Corrupted Sequence in MLM
After a sequence is corrupted for Masked Language Modeling (MLM), such as [CLS] It is [MASK] . [SEP] I need [MASK] hat . [SEP], it is passed to the Transformer encoder for training. Each token in the modified sequence is first converted into an input embedding (e.g., e0, e1, ... e11). The encoder then processes this sequence of embeddings to produce a sequence of contextualized hidden states (e.g., h0, h1, ... h11). The model is then trained to use these hidden states to predict the original tokens that were altered (e.g., 'raining', 'an', 'umbrella').
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
A language model's pre-training process involves corrupting input text. First, a subset of tokens (15%) is chosen for modification. Of these chosen tokens, 80% are replaced by a
[MASK]token, 10% are replaced by a random token from the vocabulary, and 10% are left unchanged. The model is then trained to predict the original tokens for all chosen positions.Given the following transformation: Original:
[CLS] The artist painted a beautiful landscape . [SEP]Corrupted:[CLS] The artist painted a beautiful [MASK] . [SEP]If 'artist' and 'landscape' were the only two tokens chosen for modification, which statement provides the most accurate analysis of the corruption process?
Encoder Processing of a Corrupted Sequence in MLM
Evaluating a Pre-training Data Corruption Step
A text sequence is being prepared for a language model's training. The goal is to intentionally alter the sequence so the model can learn to predict the original words from the altered version. Arrange the following steps to correctly describe this data preparation pipeline.
Learn After
A language model is being trained using the following modified input sequence:
[CLS] The sun is very [MASK] today . [SEP]. This sequence is converted into input embeddings and passed through a multi-layer encoder. Which of the following statements most accurately describes the final hidden state vector that corresponds to the[MASK]token after it has been processed by the encoder?A language model is being trained on the corrupted input sequence:
[CLS] The book was so [MASK] . [SEP]. Arrange the following steps in the correct chronological order, showing how the model processes this input to generate a representation suitable for predicting the masked word.Diagnosing Contextualization Failure in Model Training