Learn Before
Analyzing Text Corruption Strategies
Consider the following three methods for altering an input text sequence during the pre-training of a language model:
- Randomly replacing 15% of the words in the sequence with a special placeholder symbol.
- Randomly changing the order of sentences within the sequence.
- Deleting 15% of the words at random positions throughout the sequence.
Analyze these methods and identify which one is uniquely applicable to texts composed of multiple sentences. Justify your choice by explaining why the structure of a multi-sentence text is essential for this specific method to be applied, and why the other two methods do not share this requirement.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
BART Model's Corruption Methods for Multi-Sentence Sequences
When pre-training a model on a document, a common strategy is to intentionally alter the input text and task the model with restoring the original. Which of the following alteration techniques is uniquely dependent on the input text containing more than one sentence?
When preparing text data to train a language model, various 'corruption' techniques are used to alter the original input, which the model then learns to restore. Some of these techniques operate on the word or token level, while others operate on the sentence level. Match each corruption technique described below with the structural requirement of the input text.
Analyzing Text Corruption Strategies