Self-Supervised Pre-training of Encoders via Masked Language Modeling
In the pre-training phase, an encoder model is trained using a self-supervision objective like Masked Language Modeling. The process begins by converting a corrupted input sequence, where some tokens are masked, into a sequence of embeddings. This embedding sequence is then fed into the encoder, which generates contextual vector representations for all input tokens. Finally, these representations are passed to an output layer, such as a Softmax model, which is trained to reconstruct the original masked tokens.
0
1
Contributors are:
Who are from:
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Deep Learning
Data Science
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Foundations of Large Language Models
Related
Auto-regressive Model in NLP
Autoencoding Model
Seq2Seq Model
Self-Supervised Pre-training of Encoders via Masked Language Modeling
Comparison of Self-Supervised Pre-training and Self-Training
Architectural Categories of Pre-trained Transformers
Self-Supervised Classification Tasks for Encoder Training
Prefix Language Modeling (PrefixLM)
Mask-Predict Framework
Discriminative Training
Learning World Knowledge from Unlabeled Data
Emergent Linguistic Capabilities from Pre-training
Architectural Approaches to Self-Supervised Pre-training
Self-Supervised Pre-training of Encoders via Masked Language Modeling
Word Prediction as a Core Self-Supervised Task
Learning World Knowledge from Unlabeled Data via Self-Supervision
A research team has a massive collection of unlabeled historical texts. Their goal is to pre-train a language model that understands the specific vocabulary and sentence structures within these documents, but they have no budget for manual data annotation. Which of the following approaches is the most effective and feasible for their pre-training task?
Analysis of Supervision Signal Generation
A team is developing a pre-training strategy for a new language model using a large corpus of unlabeled text. Which of the following proposed tasks best exemplifies the principles of self-supervised learning?
Prevalence of Self-Supervised Pre-training in NLP
Self-Supervised Pre-training of Encoders via Masked Language Modeling
Applying a Pre-trained Encoder to Downstream Tasks
BERT as an Illustrative Example of Pre-training and Application
A team is building a model to classify customer support emails into categories like 'Billing Inquiry', 'Technical Issue', or 'Feedback'. They have access to two datasets: 1) a massive, diverse collection of text from the internet, and 2) a curated set of 10,000 support emails, each correctly labeled with its category. Based on the standard two-stage training paradigm for this type of model, which statement best describes the distinct role and objective for each dataset?
A machine learning engineer is building a model to classify legal documents as 'Contract', 'Pleading', or 'Motion'. They are following the standard two-stage paradigm for this type of model. Arrange the following steps in the correct chronological order.
Diagnosing a Model Training Failure
A language model's encoder processes an input sequence consisting of 15 tokens. The model is configured with a hidden size of 768. What will be the dimensions of the final sequence of contextualized vectors produced by this encoder?
Self-Supervised Pre-training of Encoders via Masked Language Modeling
Applying a Pre-trained Encoder to Downstream Tasks
Arrange the following steps, which describe how a standard Transformer encoder processes a sequence of tokens, into the correct chronological order.
Interpreting a Transformer Encoder's Output
Learn After
Comparison of Masked vs. Causal Language Modeling
Formal Definition of the Masking Process in MLM
Example of Masked Language Modeling with Single and Multiple Masks
Training Objective of Masked Language Modeling (MLM)
Drawback of Masked Language Modeling: The [MASK] Token Discrepancy
Limitation of MLM: Ignoring Dependencies Between Masked Tokens
The Generator in Replaced Token Detection
Consecutive Token Masking in MLM
Token Selection and Modification Strategy in BERT's MLM
BERT's Masked Language Modeling Pre-training Pipeline
Performance Degradation and Early Stopping in Pre-training
Flexibility of Masked Language Modeling for Encoder-Decoder Training
Training Objective of the Standard BERT Model
During a self-supervised pre-training process, a model is given an input sequence where one word has been replaced by a special symbol, for example: 'The quick brown [MASK] jumps over the lazy dog.' The model's objective is to predict the original word, 'fox'. Which of the following is the direct input used by the final output layer to make this specific prediction?
Original Sequence for Masking and Deletion Examples
Arrange the following steps in the correct order to describe the process of pre-training an encoder model using a masked language modeling objective.
Evaluating a Pre-training Strategy for a Specific Application