Concept

Trade-off between Memory Utilization and Access Overhead in PagedAttention

While storing data in non-contiguous memory blocks can generally introduce performance issues, such as increased seek time that reduces I/O efficiency, this overhead is minimal in the context of PagedAttention. The reason is that large-scale computations like attention are already partitioned for block-level processing. By designing a paging strategy that aligns with this computational model, the significant advantages in memory utilization can be achieved with negligible impact from memory access overhead.

0

1

Updated 2026-05-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences