Concept

PagedAttention for KV Cache Memory Optimization

Introduced in the vLLM system [Kwon et al., 2023], PagedAttention, also known as paged KV caching, is a memory optimization strategy for LLM inference. It draws inspiration from operating system paging to combat memory fragmentation, a common issue in dynamic batching with variable-length sequences. The core principle is to partition the KV cache into smaller, fixed-size memory blocks, or 'pages', which enhances memory management efficiency.

0

1

Updated 2026-05-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
Learn After