1Cademy - An autoregressive model generates a sequence token by token. In a standard implementation, the query vector at position `i` (`q_i`) computes attention over the key-value pairs from all preceding positions, from 1 to `i`. Consider a modified implementation where the query `q_i` is restricted to attend *only* to the key-value pairs from the very first position (`k_1`, `v_1`) and its own current position (`k_i`, `v_i`). How does the computational cost of calculating the attention output for a single query `q_i` scale as the sequence length `i` grows very large (e.g., from 100 to 10,000)?

Learn Before

Sparse Attention with a Fixed Key-Value Subset

Multiple Choice

An autoregressive model generates a sequence token by token. In a standard implementation, the query vector at position i (q_i) computes attention over the key-value pairs from all preceding positions, from 1 to i. Consider a modified implementation where the query q_i is restricted to attend only to the key-value pairs from the very first position (k_1, v_1) and its own current position (k_i, v_i). How does the computational cost of calculating the attention output for a single query q_i scale as the sequence length i grows very large (e.g., from 100 to 10,000)?

Updated 2025-10-03

Contributors are:

Who are from:

Learn Before

Related