An inference server running a large language model processes thousands of text generation requests, each with a different sequence length. The server allocates memory for the key and value vectors of each sequence as a single, contiguous block. After some time, the server begins to fail when trying to allocate memory for new requests, despite system monitoring tools showing that a significant total amount of memory is still free. Which statement best analyzes the most likely reason for these allocation failures?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Memory Fragmentation in LLM Inference
Comparison of Memory Allocation in Standard vs. Paged Attention
Diagnosing Inference Server Failures
An inference server running a large language model processes thousands of text generation requests, each with a different sequence length. The server allocates memory for the key and value vectors of each sequence as a single, contiguous block. After some time, the server begins to fail when trying to allocate memory for new requests, despite system monitoring tools showing that a significant total amount of memory is still free. Which statement best analyzes the most likely reason for these allocation failures?
Drawbacks of Contiguous Memory Allocation for KV Caching