Learn Before
Multiple Choice

A causal transformer model processes a sequence of 1024 tokens. In a standard attention mechanism, each token attends to all previous tokens and itself. Consider a 'sparse' variant where a token at position i (for i > 3) only attends to the following positions: the first token (position 1), its own token (position i), and the two immediately preceding tokens (positions i-1 and i-2). For a token at position 500, how many key-value pairs does it attend to in this sparse model?

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Tags

Data Science

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Related