Concept

Chunked and Windowed Attention

Chunked and windowed attention is a technique designed to mitigate the memory consumption of the KV cache. It works by limiting the self-attention mechanism's scope to a recent subset of tokens (a 'window' or 'chunk'). This approach effectively reduces the size of the KV cache that needs to be stored. However, this memory saving comes with a trade-off: it can lead to a loss of context from older tokens or require additional computation to reprocess past information if needed.

0

1

Updated 2026-05-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences