1Cademy - Comparing Memory Usage of Attention Mechanisms

Learn Before

Space Complexity of Sliding Window Attention

Short Answer

Comparing Memory Usage of Attention Mechanisms

A large language model is processing a very long document (sequence length m). Compare the growth of the Key-Value (KV) cache's memory footprint as m increases for two scenarios: (1) the model uses standard attention, and (2) the model uses sliding window attention with a fixed window size m_w. Explain which approach is more scalable for processing extremely long sequences and why.

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related