1Cademy - Space Complexity of Sliding Window Attention

Learn Before

Fixed-Size Window Memory as a Form of Local Attention
Reducing KV Cache Complexity via Windowed Caching

Formula

Space Complexity of Sliding Window Attention

In sliding window attention, the space complexity of the Key-Value (KV) cache is reduced by storing keys and values for only a fixed-size window of recent tokens ( $m_w$ ), rather than the entire sequence. This approach results in a constant memory footprint with respect to the sequence length, defined by the formula $O(L \cdot \tau \cdot d_h \cdot m_w)$ , where $L$ is the number of layers, $\tau$ is the number of attention heads, and $d_h$ is the head dimension.