1Cademy - Linear Attention

Learn Before

Concept

Linear Attention

Linear attention is an efficient alternative designed to overcome the memory-intensive limitations of explicitly retaining the entire Key-Value (KV) cache ( $\mathbf{K}_{\le i}$ and $\mathbf{V}_{\le i}$ ) during the inference of very long sequences. It modifies standard attention by employing a kernel function $\phi(\cdot)$ to project each query vector ( $\mathbf{q}_i$ ) and key vector ( $\mathbf{k}_i$ ) into new representations: $\mathbf{q}'_i = \phi(\mathbf{q}_i)$ and $\mathbf{k}'_i = \phi(\mathbf{k}_i)$ . By applying this transformation and removing the standard Softmax function, the order of matrix multiplications can be rearranged. This structural change avoids the need to compute the large attention matrix and eliminates the requirement to explicitly store the KV cache, making the process highly memory-efficient.

Updated 2026-04-22

Contributors are:

Who are from:

References

Learn Before

Related

Learn After