Google

In sliding window attention, the space complexity of the Key-Value (KV) cache is reduced by storing keys and values for only a fixed-size window of recent tokens ($$m_w$$), rather than the entire sequence. This approach results in a constant memory footprint with respect to the sequence length, defined by the formula $$O(L \cdot \tau \cdot d_h \cdot m_w)$$, where $$L$$ is the number of layers, $$\tau$$ is the number of attention heads, and $$d_h$$ is the head dimension.

Space Complexity of Sliding Window Attention

A large language model is configured to process text by only storing and considering the keys and values of the most recent 512 tokens when calculating attention for each new token. As the model processes a document that grows from 1,000 tokens to 100,000 tokens in length, how will the memory required for this key-value storage be affected?

The engineer proposes modifying the model to only store and use the keys and values from the most recent 1024 tokens when generating a new response. Evaluate this proposed solution. Explain how it addresses the memory issue and describe one significant potential drawback of this approach.

Chatbot Memory Optimization

A large language model is processing a very long document (sequence length `m`). Compare the growth of the Key-Value (KV) cache's memory footprint as `m` increases for two scenarios: (1) the model uses standard attention, and (2) the model uses sliding window attention with a fixed window size `m_w`. Explain which approach is more scalable for processing extremely long sequences and why.

Learn Before

Related