Concept

Memory Bottleneck from KV Cache in LLMs

During inference, the Key-Value (KV) cache grows linearly with the length of the input sequence. While this is more efficient than quadratic growth, the memory footprint for extremely long sequences can become so significant that it makes the deployment of LLMs for such tasks infeasible. This memory consumption is a primary bottleneck for applying standard Transformers to long-context problems.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models