Essay

Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure

You operate an internal LLM inference service for a company knowledge assistant. Traffic has two dominant patterns: (1) many users start chats with the same 300-token “policy + safety + tool instructions” system prompt, then ask different questions; (2) a smaller set of power users submit long, unique prompts (2,000–4,000 tokens). The server uses continuous batching and must keep p95 latency low. Recently, you observe that GPU memory monitoring often shows ~25% free memory, yet new long requests intermittently fail to start or cause sharp throughput drops after the system has been running for hours.

Write an evaluation recommending a concrete inference-time caching and memory-management approach that addresses both compute and memory issues. Your answer must explain, in one coherent argument, how (a) KV cache growth differs between the initial prompt processing and token-by-token generation, (b) prefix caching changes the amount of prefilling work for shared-prefix requests and what it costs in memory, and (c) memory fragmentation can cause “free memory but allocation failure,” including how paged KV caching (PagedAttention) would change allocation behavior. Conclude with a justified recommendation (e.g., enable/disable prefix caching, use paged KV caching, and any constraints such as eviction policy or what to cache) and explicitly discuss the tradeoffs you are accepting.

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Related