Learn Before
Representing Masked Spans with Sentinel Tokens
An alternative approach to handling consecutive masked tokens is to treat them as a single span. Following the methodology of Raffel et al. (2020), a unique sentinel token, such as [X], [Y], or [Z], replaces one or more adjacent masked tokens in the corrupted input sequence, effectively creating placeholder slots. The model's training task is then to fill these slots with the correct original tokens using the surrounding context. A significant advantage of consolidating multiple masks into a single placeholder is that the sequences used during training become shorter, leading to more computationally efficient training.
0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Example of Denoising Task with Consecutive Token Masking
Representing Masked Spans with Sentinel Tokens
A language model is being trained to predict masked words in a text. Consider two different masking strategies:
Strategy 1: 15% of the words in a sentence are masked individually at random positions. Example:
The quick [MASK] fox jumps [MASK] the lazy dog.Strategy 2: A contiguous span of several words is masked. Example:
The quick [MASK] [MASK] [MASK] the lazy dog.How does using Strategy 2 (masking a contiguous span) primarily alter the learning challenge for the model compared to Strategy 1?
Analyzing a Masked Language Modeling Task
Analyzing Model Performance Discrepancy
Analyzing the Challenge of Consecutive Masking
Learn After
Example of Representing Masked Spans with Sentinel Tokens
A language model is given the input sentence: 'The new algorithm significantly improves performance on the benchmark.' In this instance, the consecutive words 'significantly improves' and 'on the benchmark' are both masked for a training task. How would this input be correctly represented if the modeling approach replaces each contiguous masked span with a single, unique placeholder token?
In a text processing approach where a single, unique placeholder replaces a span of one or more masked words, it is permissible to use the same placeholder (e.g.,
[X]) to represent two different, non-overlapping masked spans within the same input sentence.Consider the following input sequence where several consecutive tokens have been masked:
The model was trained [MASK] [MASK] a large dataset and then [MASK] [MASK] [MASK] on a specific task.Rewrite this sequence by replacing each contiguous span of masked tokens with a single, unique sentinel token, starting with[X]. Your answer should be the complete, rewritten sequence: ____