KV Cache Allocation in a Fragmented Memory Scenario
An LLM inference system needs to allocate key-value cache memory for a new sequence that requires 4 blocks. The system's physical memory has 5 free blocks in total, but they are not located next to each other, as shown: [Used, Free, Used, Used, Free, Free, Used, Free, Free]. Based on the principle of partitioning the cache into blocks that can be stored in non-contiguous locations, explain whether the system can fulfill this request and describe the key benefit of this memory allocation strategy.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Trade-off between Memory Utilization and Access Overhead in PagedAttention
An LLM inference server manages its key-value cache by allocating a single, continuous block of memory for each user request. The server often rejects new, long requests, citing insufficient memory, even when the total amount of free memory is much larger than the requested amount. This issue is particularly common after many shorter requests have been processed and their memory has been freed. Which of the following best explains this problem and how partitioning the cache into smaller, fixed-size blocks that can be stored in non-contiguous locations would resolve it?
KV Cache Allocation in a Fragmented Memory Scenario
Memory Allocation Strategy Analysis