Concept

Linear Attention

Linear attention is an efficient alternative designed to overcome the memory-intensive limitations of explicitly retaining the entire Key-Value (KV) cache (Ki\mathbf{K}_{\le i} and Vi\mathbf{V}_{\le i}) during the inference of very long sequences. It modifies standard attention by employing a kernel function ϕ()\phi(\cdot) to project each query vector (qi\mathbf{q}_i) and key vector (ki\mathbf{k}_i) into new representations: qi=ϕ(qi)\mathbf{q}'_i = \phi(\mathbf{q}_i) and ki=ϕ(ki)\mathbf{k}'_i = \phi(\mathbf{k}_i). By applying this transformation and removing the standard Softmax function, the order of matrix multiplications can be rearranged. This structural change avoids the need to compute the large attention matrix and eliminates the requirement to explicitly store the KV cache, making the process highly memory-efficient.

0

1

Updated 2026-04-22

Tags

Data Science

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
Learn After