Learn Before
Word Prediction as a Core Self-Supervised Task
A foundational self-supervised learning strategy for prominent NLP models like BERT and GPT is based on word prediction. This approach involves training a model to predict masked or subsequent words within a vast text corpus. This simple objective enables the model to develop a general capacity for both language understanding and generation without explicit, human-provided labels.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Comparison of Self-Supervised Pre-training and Self-Training
Architectural Categories of Pre-trained Transformers
Self-Supervised Classification Tasks for Encoder Training
Prefix Language Modeling (PrefixLM)
Mask-Predict Framework
Discriminative Training
Learning World Knowledge from Unlabeled Data
Emergent Linguistic Capabilities from Pre-training
Architectural Approaches to Self-Supervised Pre-training
Self-Supervised Pre-training of Encoders via Masked Language Modeling
Word Prediction as a Core Self-Supervised Task
Learning World Knowledge from Unlabeled Data via Self-Supervision
A research team has a massive collection of unlabeled historical texts. Their goal is to pre-train a language model that understands the specific vocabulary and sentence structures within these documents, but they have no budget for manual data annotation. Which of the following approaches is the most effective and feasible for their pre-training task?
Analysis of Supervision Signal Generation
A team is developing a pre-training strategy for a new language model using a large corpus of unlabeled text. Which of the following proposed tasks best exemplifies the principles of self-supervised learning?
Prevalence of Self-Supervised Pre-training in NLP
Learn After
An AI team trains a language model on a massive dataset of books and articles. The training process consists of a single, repeated task: the model is presented with a sentence where one word has been removed, and its goal is to predict the original missing word. The model is not given any other information or explicit rules about grammar or meaning. Based on this training method alone, what fundamental understanding is the model most likely developing?
Comparing Word Prediction Strategies
From Prediction to Understanding