Learn Before
A large language model processes a long document. Consider two different sparse attention patterns that could be used instead of the standard all-to-all attention:
- Pattern A: Each token can only attend to the 16 tokens immediately preceding and following it.
- Pattern B: In addition to its local neighbors, each token can also attend to a few pre-selected 'global' tokens that are distributed across the entire sequence (e.g., the first token, the last token).
Which statement best analyzes the primary difference in how these two patterns capture information?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An AI development team is building a model to summarize entire books, which involves processing extremely long sequences of text. To manage the computational resources, they decide to replace the standard self-attention mechanism, where every token attends to every other token, with a sparse one that restricts connections. Which of the following statements provides the most accurate evaluation of this decision's primary trade-off?
A large language model processes a long document. Consider two different sparse attention patterns that could be used instead of the standard all-to-all attention:
- Pattern A: Each token can only attend to the 16 tokens immediately preceding and following it.
- Pattern B: In addition to its local neighbors, each token can also attend to a few pre-selected 'global' tokens that are distributed across the entire sequence (e.g., the first token, the last token).
Which statement best analyzes the primary difference in how these two patterns capture information?
Optimizing a Legal Document Analysis Model