Multiple Choice

An LLM inference server manages its key-value cache by allocating a single, continuous block of memory for each user request. The server often rejects new, long requests, citing insufficient memory, even when the total amount of free memory is much larger than the requested amount. This issue is particularly common after many shorter requests have been processed and their memory has been freed. Which of the following best explains this problem and how partitioning the cache into smaller, fixed-size blocks that can be stored in non-contiguous locations would resolve it?

0

1

Updated 2025-10-01

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science