1Cademy - A large language model processes a long document. Consider two different sparse attention patterns that could be used instead of the standard all-to-all attention: * **Pattern A:** Each token can only attend to the 16 tokens immediately preceding and following it. * **Pattern B:** In addition to its local neighbors, each token can also attend to a few pre-selected global tokens that are distributed across the entire sequence (e.g., the first token, the last token). Which statement best analyzes the primary difference in how these two patterns capture information?

Learn Before

Sparse Attention Mechanisms

Multiple Choice

A large language model processes a long document. Consider two different sparse attention patterns that could be used instead of the standard all-to-all attention:

Pattern A: Each token can only attend to the 16 tokens immediately preceding and following it.
Pattern B: In addition to its local neighbors, each token can also attend to a few pre-selected 'global' tokens that are distributed across the entire sequence (e.g., the first token, the last token).

Which statement best analyzes the primary difference in how these two patterns capture information?

Updated 2025-10-03

Contributors are: