Learn Before
Evaluating Input Corruption Strategies for Typo Resilience
A research team is pre-training a language model with the primary objective of making it highly robust to common typographical errors found in real-world text. They are debating between two methods for damaging the input sentences during training:
- Method 1: A portion of the words in each input sentence are completely removed and substituted with a generic placeholder symbol. The model must then predict the original words that belong in those placeholder positions.
- Method 2: A portion of the words in each input sentence are replaced with different, randomly selected words from the vocabulary. The model must then learn to identify these incorrect words and restore the original sentence.
Evaluate which of these two methods is more suitable for achieving the team's specific goal. Justify your choice by explaining the strengths and weaknesses of each method in the context of learning to handle typographical errors.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A language model developer is pre-training a model with the specific goal of improving its ability to identify and correct sentences containing incorrect word choices (e.g., distinguishing between 'your' and 'you're'). The model is trained to reconstruct the original, correct sentence from a deliberately damaged version. Which of the following input damage strategies would be most effective for this specific training objective?
Comparing Input Corruption Strategies
Evaluating Input Corruption Strategies for Typo Resilience