Concept

Memory Allocation for KV Caching in Standard Self-Attention

In a standard self-attention implementation, the Key-Value (KV) cache for each sequence is stored as a single, contiguous block of memory. While this approach allows for efficient data access, it requires reserving a large, continuous space. This requirement leads to memory fragmentation as sequences of varying lengths are dynamically allocated and deallocated, creating small, unusable memory gaps that complicate future allocations.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related