Comparison

Comparison of Memory Allocation in Standard vs. Paged Attention

The allocation of memory for the Key-Value (KV) cache presents a sharp contrast between standard self-attention and PagedAttention. In standard self-attention implementations, the KV cache must be stored in a single, contiguous block of memory to allow for efficient access. If the available memory is fragmented into smaller, unused pieces, the standard approach cannot utilize them. Conversely, PagedAttention divides the KV cache into smaller, fixed-size memory blocks that are not necessarily contiguous. This partitioning allows the system to effectively allocate the cache into fragmented memory regions, thereby resolving the limitations of the contiguous memory requirement and achieving significantly better memory utilization.

Image 0

0

1

Updated 2026-05-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related