Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
You are the on-call engineer for an internal LLM gateway that serves two high-volume products on the same GPU pool: (A) a customer-support chat agent and (B) a report generator. Both products use the same 220-token system prompt, but user prompts vary from 20–2,000 tokens. Typical outputs are 50 tokens for (A) and 1,500 tokens for (B). The serving stack uses continuous batching and stores each request’s KV cache in a single contiguous allocation that grows as decoding proceeds.
Over the last week, you observe two symptoms that often occur together during peak hours:
- New long requests fail to start with an out-of-memory error even when monitoring shows ~25% of GPU memory is free.
- P99 token latency during streaming generation increases steadily over time, especially when many long outputs are in flight.
A teammate proposes a quick fix: “Enable prefix caching for the shared system prompt; that will reduce compute and should also fix the memory issues.” Another teammate proposes: “Switch KV cache allocation to a paged/block-based scheme (PagedAttention-style) to eliminate fragmentation; prefix caching is optional.”
As the incident lead, choose which proposal you would implement first (prefix caching first vs paged KV caching first), and justify your decision by explicitly connecting: (i) what happens in prefilling vs decoding, (ii) how the KV cache grows and is reused across decoding steps, (iii) why the system can OOM despite free memory (fragmentation), and (iv) how your chosen change affects both memory behavior and end-to-end latency for these two products. Your answer should also name one concrete risk/tradeoff introduced by your chosen change.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.5 Inference - Foundations of Large Language Models
Related
Space Complexity of the KV Cache
Updating the KV Cache
Two-Phase Inference from a KV Cache Perspective
Single-Step Generation with a KV Cache
Memory Allocation for KV Caching in Standard Self-Attention
Multi-Dimensional Structure of the KV Cache
An autoregressive language model generates text one word at a time. To generate the 100th word, it must relate it to all 99 previous words. A common optimization involves storing in memory the intermediate representations for each of the first 99 words as they are generated.
Which statement best analyzes the primary computational advantage of this optimization compared to re-computing everything from scratch at step 100?
Chatbot Performance Degradation
Computational Steps in Cached Inference
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
You run an internal LLM inference service for empl...
Your company’s internal LLM service handles many c...
You operate a GPU-backed LLM service that uses con...
You’re on-call for an internal LLM chat service. M...
Formula for KV Cache Prefilling
Prefix Caching for LLM Inference
Prefilling as an Encoding Process
Disaggregation of Prefilling and Decoding using Pipelined Engines
Prefilling in One Go (Standard Prefilling)
A large language model is given a 1000-token document to process before it begins generating a new, multi-token response. Which statement best analyzes the fundamental computational difference between how the model processes the initial 1000-token document versus how it will subsequently generate each new token for its response?
LLM Inference Performance Analysis
Parallel Self-Attention in the Prefilling Phase
The Role and Output of the Prefilling Phase
You run an internal LLM inference service for empl...
You’re on-call for an internal LLM chat service. M...
You operate a GPU-backed LLM service that uses con...
Your company’s internal LLM service handles many c...
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service
Decoding Network for KV Cache Generation
Diagram of the Decoding Phase
Single-Step Generation with a KV Cache
Comparison of Prefilling and Decoding Phases
Disaggregation of Prefilling and Decoding using Pipelined Engines
After a large language model processes an initial prompt, it enters a generation stage where it produces the output sequence one token at a time. In each step of this stage, a new query vector is generated for the current position, and it must perform an attention operation over the key-value pairs of the initial prompt plus all the key-value pairs of the tokens generated in previous steps. As the output sequence gets longer, what becomes the most significant performance bottleneck for generating each new token?
A large language model has finished processing an initial prompt and is about to generate the first token of its response. Arrange the following events in the correct chronological order for this single generation step.
Evaluating an Inference Optimization Proposal
You run an internal LLM inference service for empl...
You’re on-call for an internal LLM chat service. M...
You operate a GPU-backed LLM service that uses con...
Your company’s internal LLM service handles many c...
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service
Decoding Phase Goal Formula
Non-Contiguous Memory Allocation in PagedAttention
Flexible Memory Management with PagedAttention
Applicability of PagedAttention to Batched Inference
Comparison of Memory Allocation in Standard vs. Paged Attention
Improved Memory Utilization with PagedAttention
Parallelization of KV Caching in PagedAttention
An LLM inference server is handling multiple, concurrent text generation requests with varying sequence lengths. System monitoring reveals that although 30% of the total GPU memory is free, the server often fails when trying to start a new request that requires a large key-value (KV) cache. The allocation failure occurs because no single, continuous block of free memory is large enough. Which of the following best diagnoses the problem and proposes an effective solution?
Comparative Analysis of KV Cache Memory Allocation
Match each memory management term with its correct description in the context of large language model inference.
You run an internal LLM inference service for empl...
You’re on-call for an internal LLM chat service. M...
You operate a GPU-backed LLM service that uses con...
Your company’s internal LLM service handles many c...
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service
Example of Padded Sequences in Fragmented Memory
PagedAttention for KV Cache Memory Optimization
An LLM serving system is processing numerous concurrent requests of varying lengths. As requests are completed, their associated memory is freed. After running for some time, the system's overall throughput decreases, and it frequently fails to start processing new, long sequences, even though monitoring tools show that a significant percentage of total memory is free. Based on this scenario, what is the most accurate evaluation of the underlying problem?
LLM Memory Allocation Failure Analysis
The Paradox of Free Memory in LLM Serving
You run an internal LLM inference service for empl...
You’re on-call for an internal LLM chat service. M...
You operate a GPU-backed LLM service that uses con...
Your company’s internal LLM service handles many c...
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service
Process of Generating Prefix Caches
Process of Utilizing a Prefix Cache
Implementing Prefix Caching with a Key-Value Datastore
Memory Management Challenges in Prefix Caching
Cache Eviction Policies for Prefix Caching
An LLM inference system is designed to optimize performance by storing the intermediate hidden states generated from the initial tokens of user prompts. The system has just finished processing the request: 'Analyze the market trends for electric vehicles in North America.' Immediately after, it receives a new request: 'Analyze the market trends for electric vehicles in Europe.' How will the system leverage its optimization technique to process this second request?
Evaluating Caching Strategy Effectiveness
Choosing an Optimal Caching Strategy
You run an internal LLM inference service for empl...
You’re on-call for an internal LLM chat service. M...
You operate a GPU-backed LLM service that uses con...
Your company’s internal LLM service handles many c...
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service