Learn Before
Sparse Attention with a Fixed Key-Value Subset
This form of attention mechanism restricts the query vector at a given position i, denoted as q_i, to interact with a predefined, sparse subset of key-value pairs. Instead of attending to the entire history of keys and values (K_≤i, V_≤i), the attention is calculated only over a specific set, such as {k_1, k_i} for keys and {v_1, v_i} for values. The formula is expressed as: Att_qkv(q_i, {k_1, k_i}, {v_1, v_i}).
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
A language model generates text token by token. At each step 'i', an attention operation computes an output using a query vector and a memory component. In a standard causal implementation, this memory component is defined as the complete set of key and value vectors from all previous steps (1 to i). Based on this definition, what is the direct relationship between the size of this memory component and the length of the generated sequence 'i'?
Sparse Attention with a Fixed Key-Value Subset
Evaluating Memory Models in Attention Mechanisms
Evaluating an Attention Mechanism for a Real-Time Application
Learn After
An autoregressive model generates a sequence token by token. In a standard implementation, the query vector at position
i(q_i) computes attention over the key-value pairs from all preceding positions, from 1 toi. Consider a modified implementation where the queryq_iis restricted to attend only to the key-value pairs from the very first position (k_1,v_1) and its own current position (k_i,v_i). How does the computational cost of calculating the attention output for a single queryq_iscale as the sequence lengthigrows very large (e.g., from 100 to 10,000)?Trade-offs in Attention Mechanisms
Optimizing Attention for Long-Sequence Processing