Short Answer

Drawbacks of Contiguous Memory Allocation for KV Caching

An inference engine for a large language model uses a standard self-attention mechanism where the key-value cache for each text sequence is stored in a single, contiguous block of memory. Explain the primary drawback of this memory allocation strategy, especially in a high-throughput environment where many sequences of varying lengths are processed concurrently.

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science