Based on the operational principles of autoregressive generation, evaluate the fundamental flaw in the engineer's proposed design described in the case study below. Explain why this approach is not feasible.

Google

Following the prefilling stage, the decoding phase utilizes the pre-computed key-value pairs stored in the KV cache to autoregressively generate subsequent tokens one by one.

Decoding Phase in Transformer Inference

The decoding phase of a Transformer, as illustrated in its diagram, operates sequentially to generate one token at a time. In this step-by-step process, the model uses the token from the previous step as input to an embedding layer, which then generates a new query vector. This query attends to an expanding set of keys and values, comprising those from the initial prompt (prefilling phase) and all previously generated tokens. The output from this self-attention mechanism is processed by a Softmax layer to calculate the conditional probability for the next token, such as `Pr(yn|x, y<n)`. This autoregressive cycle is repeated for each new token in the output sequence.

Diagram of the Decoding Phase

The prefilling and decoding phases of Large Language Model inference differ significantly across several dimensions. While prefilling aims to establish the initial context from the input sequence, decoding focuses on continuing to generate subsequent tokens. In prefilling, tokens are visible all at once and processed in parallel to build an encoded contextual representation. In contrast, decoding operates with sequential visibility, predicting one token at a time using the previously cached key-value pairs. Consequently, prefilling is typically a compute-bound process with a high computational cost, whereas decoding is memory-bound and incurs a very high computational cost as the sequence grows.

Comparison of Prefilling and Decoding Phases

This strategy, known as the disaggregation of prefilling and decoding, implements continuous batching by using two specialized hardware engines. A dedicated 'Engine 1' performs prefilling for a batch of requests. Once complete, the generated Key-Value (KV) cache is sent to a separate 'Engine 2' for decoding. The primary benefit of this pipeline is that Engine 1 can immediately start prefilling the next batch while Engine 2 is decoding the first. This overlapping of computations is key to improving computational efficiency and maximizing hardware utilization.

Disaggregation of Prefilling and Decoding using Pipelined Engines

After a large language model processes an initial prompt, it enters a generation stage where it produces the output sequence one token at a time. In each step of this stage, a new query vector is generated for the current position, and it must perform an attention operation over the key-value pairs of the initial prompt *plus* all the key-value pairs of the tokens generated in previous steps. As the output sequence gets longer, what becomes the most significant performance bottleneck for generat

A large language model has finished processing an initial prompt and is about to generate the first token of its response. Arrange the following events in the correct chronological order for this single generation step.

Evaluating an Inference Optimization Proposal

You run an internal LLM inference service for empl...

You’re on-call for an internal LLM chat service. M...

You operate a GPU-backed LLM service that uses con...

Your company’s internal LLM service handles many c...

You operate an internal LLM inference service for employees. Traffic has two dominant patterns: (1) many requests start with the same 200-token “policy + tool instructions” prefix and then diverge, and (2) a smaller number of ad‑hoc requests have long, unique prompts (2,000–6,000 tokens). The service uses continuous batching and must keep p95 latency stable.

A proposed redesign includes: (a) prefix caching that stores the KV cache state for the shared 200-token prefix so future requests can skip recomputing that portion of the prompt, and (b) PagedAttention (paged KV caching) so each sequence’s KV cache grows in fixed-size pages rather than requiring a single contiguous allocation.

Write an evaluation that explains, in one coherent argument, how this redesign changes GPU compute and memory behavior across BOTH the prefilling phase and the token-by-token decoding phase. Your answer must:
- Explain what work is avoided (and what is not avoided) when a request hits the prefix cache, and how that changes prefilling cost and time-to-first-token.
- Explain why decoding still depends on the KV cache and how KV cache growth during decoding interacts with variable output lengths.
- Analyze how memory fragmentation can arise in a standard contiguous KV allocation scheme under this workload, and how paged KV allocation changes the failure/throughput profile.
- Identify at least two concrete tradeoffs/risks introduced by combining prefix caching with paged KV caching (e.g., memory overhead, eviction behavior, page table/indirection costs, cache hit-rate sensitivity), and recommend one operational policy (e.g., what to cache/evict or when to disable caching) to keep p95 latency stable.

Assume the model is an autoregressive Transformer decoder and that the KV cache stores keys/values for all previously processed tokens (prompt + generated tokens).

Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths

You operate an internal LLM inference service for a company knowledge assistant. Traffic has two dominant patterns: (1) many users start chats with the same 300-token “policy + safety + tool instructions” system prompt, then ask different questions; (2) a smaller set of power users submit long, unique prompts (2,000–4,000 tokens). The server uses continuous batching and must keep p95 latency low. Recently, you observe that GPU memory monitoring often shows ~25% free memory, yet new long requests intermittently fail to start or cause sharp throughput drops after the system has been running for hours.

Write an evaluation recommending a concrete inference-time caching and memory-management approach that addresses both compute and memory issues. Your answer must explain, in one coherent argument, how (a) KV cache growth differs between the initial prompt processing and token-by-token generation, (b) prefix caching changes the amount of prefilling work for shared-prefix requests and what it costs in memory, and (c) memory fragmentation can cause “free memory but allocation failure,” including how paged KV caching (PagedAttention) would change allocation behavior. Conclude with a justified recommendation (e.g., enable/disable prefix caching, use paged KV caching, and any constraints such as eviction policy or what to cache) and explicitly discuss the tradeoffs you are accepting.

Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure

You operate a GPU-based LLM inference service that uses continuous batching to serve many concurrent chat sessions. Each request has (a) a prompt that must be processed before generation starts and (b) a variable-length generated response. Production telemetry shows two symptoms: (1) latency spikes occur when many new requests arrive with long prompts that share a common system prefix (e.g., the same 200-token policy header), and (2) after several hours of mixed traffic, the service sometimes fails to admit a new long request even though ~25–35% of GPU memory is reported free.

Write an engineering recommendation memo that proposes a coherent end-to-end approach to reduce both the latency spikes and the admission failures. Your memo must explicitly connect: how the KV cache is created and grows across the prompt-processing stage versus token-by-token generation; how reusing KV states for shared prompt prefixes changes the amount of prompt work performed; why the observed “free memory but cannot allocate” symptom can occur in KV-cache allocation; and how a paged/block-based KV-cache allocator would change the failure mode and memory utilization. Conclude by stating at least two concrete tradeoffs/risks (e.g., memory overhead, eviction policy complexity, access patterns) and how you would validate the improvement with metrics or experiments.

Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack

You are on-call for an internal LLM chat-completions service used by multiple product teams. Traffic has two dominant patterns: (1) many requests share an identical 250-token system prompt (policy + formatting) but have different user messages; (2) a smaller set of power users send very long, unique prompts (2,000–6,000 tokens). The service uses continuous batching and standard contiguous KV-cache allocation per sequence.

Symptoms over a 2-hour window:
- P50 time-to-first-token (TTFT) is good, but P99 TTFT spikes when long prompts arrive.
- During spikes, GPU monitoring shows ~25–35% total memory free, yet new long requests sometimes fail to start with an out-of-memory/allocation error.
- When failures happen, short requests still decode, but overall throughput drops.

You are allowed to change only inference-time memory/caching strategy (no model changes). Propose a concrete design that addresses BOTH (a) the TTFT spikes and (b) the allocation failures, using KV-cache behavior across prefilling vs decoding, prefix caching, and a fragmentation-aware KV memory scheme. In your answer, explain the causal chain from the current design to the observed symptoms, and justify the tradeoffs your design makes (e.g., memory overhead vs compute saved, and any impact on decoding performance).

Stabilizing latency and GPU memory in a chat-completions service with shared system prompts

You are the on-call engineer for an internal LLM gateway that serves two high-volume products on the same GPU pool: (A) a customer-support chat agent and (B) a report generator. Both products use the same 220-token system prompt, but user prompts vary from 20–2,000 tokens. Typical outputs are 50 tokens for (A) and 1,500 tokens for (B). The serving stack uses continuous batching and stores each request’s KV cache in a single contiguous allocation that grows as decoding proceeds.

Over the last week, you observe two symptoms that often occur together during peak hours:
1) New long requests fail to start with an out-of-memory error even when monitoring shows ~25% of GPU memory is free.
2) P99 token latency during streaming generation increases steadily over time, especially when many long outputs are in flight.

A teammate proposes a quick fix: “Enable prefix caching for the shared system prompt; that will reduce compute and should also fix the memory issues.” Another teammate proposes: “Switch KV cache allocation to a paged/block-based scheme (PagedAttention-style) to eliminate fragmentation; prefix caching is optional.”

As the incident lead, choose which proposal you would implement first (prefix caching first vs paged KV caching first), and justify your decision by explicitly connecting: (i) what happens in prefilling vs decoding, (ii) how the KV cache grows and is reused across decoding steps, (iii) why the system can OOM despite free memory (fragmentation), and (iv) how your chosen change affects both memory behavior and end-to-end latency for these two products. Your answer should also name one concrete risk/tradeoff introduced by your chosen change.

Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic

You are the on-call engineer for an internal LLM gateway that serves two workloads on the same GPU pool: (A) a chat product where every request begins with the same 600-token system prompt, and (B) an agent workflow that sends highly variable prompts (50–4000 tokens) and often streams 800–1500 generated tokens. The serving stack uses continuous batching and stores each sequence’s KV cache in GPU memory during generation.

After a traffic spike, you observe the following symptoms over a 30-minute window:
1) Median time-to-first-token (TTFT) increases sharply, but tokens/second during streaming generation degrades only mildly.
2) GPU memory monitoring shows ~25% free memory, yet new long agent requests frequently fail to start with an out-of-memory allocation error.
3) When you temporarily disable reuse of the shared 600-token system prompt (i.e., you always recompute it per request), TTFT gets worse but the OOM allocation failures become less frequent.

Assume the model is a standard autoregressive Transformer with a KV cache; inference consists of an initial prompt-processing stage that populates the KV cache followed by token-by-token generation that appends to the KV cache.

As the incident owner, propose ONE coherent serving change (a single design choice, not a list) that best explains and addresses all three symptoms at once. Your answer must (i) identify the most likely root cause linking TTFT behavior and the “free memory but OOM” paradox, and (ii) justify why your chosen change improves the situation by explicitly referencing how it affects KV-cache allocation during prompt processing vs. token-by-token generation, and how it interacts with shared-prefix reuse.

Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service

In large language models, the objective of the decoding phase is to find the best predicted sequence of tokens. Instead of conditioning the prediction directly on the original input sequence, the generation process relies entirely on the contextual representation built during the preceding prefilling stage. The optimal predicted sequence, denoted as $$\hat{\mathbf{y}}$$, is determined by maximizing the conditional probability over this context: $$\hat{\mathbf{y}} = \argmax_{\mathbf{y}} \Pr(\mathbf{y}|\mathrm{cache})$$ where $$\mathrm{cache}$$ refers to the accumulated Key-Value (KV) cache.

Decoding Phase Goal Formula

During each step $$i$$ of autoregressive generation, the model computes a new query ($$\mathbf{q}_i$$), key ($$\mathbf{k}_i$$), and value ($$\mathbf{v}_i$$) vector from the current input token. The new key-value pair ($$\mathbf{k}_i, \mathbf{v}_i$$) is appended to the Key-Value (KV) cache, which holds the pairs for all preceding tokens. The attention operation is then performed using the new query $$\mathbf{q}_i$$ and the complete set of keys and values stored in the cache up to the current step, denoted as $$\mathbf{K}_{\leq i}$$ and $$\mathbf{V}_{\leq i}$$. This process generates the output for step $$i$$ by allowing the current token to attend to itself and all previous tokens in the sequence.

Learn Before

Related