Learn Before
Multiple Choice

A large language model processes a long document. Consider two different sparse attention patterns that could be used instead of the standard all-to-all attention:

  • Pattern A: Each token can only attend to the 16 tokens immediately preceding and following it.
  • Pattern B: In addition to its local neighbors, each token can also attend to a few pre-selected 'global' tokens that are distributed across the entire sequence (e.g., the first token, the last token).

Which statement best analyzes the primary difference in how these two patterns capture information?

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science