1Cademy - Memory Fragmentation in LLM Inference

Learn Before

Continuous Batching for LLM Inference
Memory Allocation for KV Caching in Standard Self-Attention

Concept

Memory Fragmentation in LLM Inference

During the process of generating text, language models continuously allocate and deallocate memory, particularly for the KV cache. This dynamic memory usage can lead to fragmentation, where the available memory is split into numerous small, non-contiguous blocks. The diagram visualizes this with interspersed used and free memory blocks. This fragmentation poses a significant challenge, as it can prevent the allocation of large, contiguous memory chunks needed for new or growing sequences, thereby reducing system efficiency.