1Cademy - An LLM inference server manages its key-value cache by allocating a single, continuous block of memory for each user request. The server often rejects new, long requests, citing insufficient memory, even when the total amount of free memory is much larger than the requested amount. This issue is particularly common after many shorter requests have been processed and their memory has been freed. Which of the following best explains this problem and how partitioning the cache into smaller, fixed-size blocks that can be stored in non-contiguous locations would resolve it?

Learn Before

Non-Contiguous Memory Allocation in PagedAttention

Multiple Choice

An LLM inference server manages its key-value cache by allocating a single, continuous block of memory for each user request. The server often rejects new, long requests, citing insufficient memory, even when the total amount of free memory is much larger than the requested amount. This issue is particularly common after many shorter requests have been processed and their memory has been freed. Which of the following best explains this problem and how partitioning the cache into smaller, fixed-size blocks that can be stored in non-contiguous locations would resolve it?

Updated 2025-10-01

Contributors are:

Who are from:

Learn Before

Related