Learn Before
Model Selection for Long-Sequence Tasks
An engineering team has trained two different language models, Model A and Model B, on a dataset where the maximum text length is 512 tokens. The models differ only in how they handle the relative distance between tokens in their attention mechanism:
- Model A: Learns a unique, independent bias parameter for every possible relative distance it encountered during training (i.e., a separate parameter for distance 1, distance 2, ..., up to distance 511).
- Model B: Learns unique parameters for small, common distances but groups larger distances into 'buckets'. All distances within a single bucket (e.g., all distances from 64 to 95) share a single, common bias parameter.
The team now needs to deploy one of these models for a task that involves processing documents up to 2000 tokens long. Which model should they choose, and why?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A language model's attention mechanism uses a relative positional bias. During its training on text segments never exceeding 512 tokens, it learns a unique bias parameter for each specific relative distance from 1 to 63. However, for all distances from 64 to 127, it uses a single shared parameter, and for all distances from 128 to 255, it uses another single shared parameter, and so on. The model is now required to process a document of 2048 tokens. Which statement best analyzes the primary benefit of using shared parameters for larger distances in this scenario?
Model Selection for Long-Sequence Tasks
Rationale for Parameter Sharing in Positional Bias