Case Study

Stabilizing latency and GPU memory in a chat-completions service with shared system prompts

You are on-call for an internal LLM chat-completions service used by multiple product teams. Traffic has two dominant patterns: (1) many requests share an identical 250-token system prompt (policy + formatting) but have different user messages; (2) a smaller set of power users send very long, unique prompts (2,000–6,000 tokens). The service uses continuous batching and standard contiguous KV-cache allocation per sequence.

Symptoms over a 2-hour window:

  • P50 time-to-first-token (TTFT) is good, but P99 TTFT spikes when long prompts arrive.
  • During spikes, GPU monitoring shows ~25–35% total memory free, yet new long requests sometimes fail to start with an out-of-memory/allocation error.
  • When failures happen, short requests still decode, but overall throughput drops.

You are allowed to change only inference-time memory/caching strategy (no model changes). Propose a concrete design that addresses BOTH (a) the TTFT spikes and (b) the allocation failures, using KV-cache behavior across prefilling vs decoding, prefix caching, and a fragmentation-aware KV memory scheme. In your answer, explain the causal chain from the current design to the observed symptoms, and justify the tradeoffs your design makes (e.g., memory overhead vs compute saved, and any impact on decoding performance).

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Related