Learn Before
A machine learning engineer is training a model to reconstruct a document from a corrupted version. They are considering two different strategies for creating the corrupted input:
- Strategy A: Replace 15% of the words in the document, chosen at random, each with a single
[MASK]token. - Strategy B: Replace three separate, contiguous spans of words (which together make up 15% of the document's total words) with a single
[SPAN]token for each span.
Assuming all other factors are equal, which strategy is likely to result in a more computationally efficient training process, and why?
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A machine learning engineer is training a model to reconstruct a document from a corrupted version. They are considering two different strategies for creating the corrupted input:
- Strategy A: Replace 15% of the words in the document, chosen at random, each with a single
[MASK]token. - Strategy B: Replace three separate, contiguous spans of words (which together make up 15% of the document's total words) with a single
[SPAN]token for each span.
Assuming all other factors are equal, which strategy is likely to result in a more computationally efficient training process, and why?
- Strategy A: Replace 15% of the words in the document, chosen at random, each with a single
Optimizing Training Efficiency
Efficiency vs. Learning Trade-off in Denoising