During the prefilling phase, self-attention is computed for the entire input sequence in a single operation. The query, key, and value vectors are represented as matrices $\mathbf{Q}, \mathbf{K}, \mathbf{V} \in \mathbb{R}^{d \times (m+1)}$. The attention output is calculated using the scaled dot-product formula: $$\text{Att}_{\text{qkv}}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Softmax}\left(\frac{\mathbf{QK}^{\text{T}}}{\sqrt{d}} + \text{Mask}\right)\mathbf{V}$$ Here, the causal mask, $\text{Mask} \in \mathbb{R}^{(m+1) \times (m+1)}$, prevents tokens from attending to future positions by setting the corresponding entries in the attention score matrix to a large negative number (e.g., $-\infty$) before the Softmax function is applied.

Self-Attention Formula for the Prefilling Phase

The prefilling phase is generally considered a compute-bound process. This is because the parallel computation of self-attention for the entire sequence merges many operations into a single, large one. This approach minimizes data transfers between memory and the processing unit (like a GPU), meaning the primary performance limitation becomes the raw computational power of the hardware, rather than the speed at which data can be moved (memory bandwidth).

Prefilling as a Compute-Bound Process

The prefilling phase involves a parallel computation where the entire input sequence is processed at once to generate the KV cache. A key outcome of this process is the determination of the probability distribution for the first output token. Furthermore, in certain scenarios, this phase can extend to predict subsequent tokens, such as the second output token.

Token Prediction within the Prefilling Phase

When a large language model first processes a user's prompt, it can perform calculations for all words in the prompt simultaneously rather than one by one. What is the fundamental condition that makes this highly parallel approach possible during this initial stage?

What fundamental characteristic of the initial prompt processing stage allows for this high level of computational efficiency, and why does this characteristic not apply to the word-by-word generation phase?

LLM Inference Performance Analysis

A key computational advantage during the initial processing of a prompt is the ability to perform calculations for all input tokens simultaneously. Explain the fundamental reason why this high degree of parallelism is possible at this stage. In your explanation, contrast this with a situation where tokens must be processed one at a time.

Rationale for Parallelism in Initial Prompt Processing

This diagram illustrates the data flow during the prefilling stage of a Transformer. The entire input sequence, represented as tokens `x0` through `xm-1`, is initially converted into vectors by an Embedding Layer. Following this, a self-attention layer processes all these vectors simultaneously. In this parallel operation, the layer generates a complete set of query vectors (`q0` to `qm-1`), key vectors (`k0` to `km-1`), and value vectors (`v0` to `vm-1`) for the entire input sequence in a single step. This 'processed all at once' approach is the defining characteristic of the prefilling phase.

Diagram of the Prefilling Phase

A key characteristic of the prefilling phase is its ability to process the entire input sequence simultaneously. This allows for a highly parallelized self-attention computation where all query vectors are grouped into a single matrix, $\mathbf{Q}$. This approach makes efficient use of the parallel computing capabilities of modern GPUs, which significantly speeds up the prefilling process.

Google

The prefilling phase is the initial stage of Transformer inference where the model processes the input sequence, denoted as `x`, to compute and populate the Key-Value (KV) cache. This stage is named 'prefilling' because its primary function is to prepare and store the key-value vector pairs for every token in the input prompt before the generative decoding process begins.

Prefilling Phase in Transformer Inference

Reference of Foundations of Large Language Models Course

The prefilling of the Key-Value (KV) cache, a preparatory step for autoregressive inference, is represented by the formula: $$\text{cache} = \text{Dec}_{\text{kv}}(\mathbf{x})$$ In this equation, $\text{Dec}_{\text{kv}}(\cdot)$ represents the LLM's decoding network, which is architecturally identical to the standard decoding network, $\text{Dec}(\cdot)$. The key distinction is that $\text{Dec}_{\text{kv}}(\cdot)$ is configured to output the KV cache from its self-attention layers, rather than the final token representations, effectively storing the key-value pairs for the entire input sequence, $\mathbf{x}$.

Formula for KV Cache Prefilling

An advanced caching technique that extends simpler methods by storing not just full sequences, but also common prefixes and their associated hidden states. This is accomplished by processing an input sequence as in the standard prefilling phase to generate and save the Key-Value (KV) cache states for each prefix. This allows the system to reuse these cached states when a new request shares a prefix with a previously processed sequence, thereby avoiding redundant computation.

Prefix Caching for LLM Inference

The prefilling phase can be conceptualized as an encoding process, even though its underlying mechanism is based on token prediction. The primary objective during this phase is not to generate output tokens, but rather to construct a contextual representation of the input sequence in the form of the Key-Value (KV) cache. This cache is then used to condition the subsequent token generation in the decoding phase.

Prefilling as an Encoding Process

This strategy, known as the disaggregation of prefilling and decoding, implements continuous batching by using two specialized hardware engines. A dedicated 'Engine 1' performs prefilling for a batch of requests. Once complete, the generated Key-Value (KV) cache is sent to a separate 'Engine 2' for decoding. The primary benefit of this pipeline is that Engine 1 can immediately start prefilling the next batch while Engine 2 is decoding the first. This overlapping of computations is key to improving computational efficiency and maximizing hardware utilization.

Disaggregation of Prefilling and Decoding using Pipelined Engines

Standard prefilling is the conventional method for populating the Key-Value (KV) cache, where the entire input sequence is processed in a single, comprehensive forward pass. This 'prefill in one go' approach constructs the complete KV cache at once before any decoding begins.

Prefilling in One Go (Standard Prefilling)

A large language model is given a 1000-token document to process before it begins generating a new, multi-token response. Which statement best analyzes the fundamental computational difference between how the model processes the initial 1000-token document versus how it will subsequently generate each new token for its response?

Based on your understanding of how a model processes input sequences before generating new tokens, analyze the following two scenarios. Which application will dedicate a significantly larger proportion of its total computation time to the initial processing of the input prompt? Justify your answer by describing the characteristics of this initial processing phase.

Parallel Self-Attention in the Prefilling Phase

When a Transformer model begins an inference task with a given input prompt, it first performs a 'prefilling' phase. In your own words, explain the primary objective of this phase and identify its main computational output.

The Role and Output of the Prefilling Phase

You run an internal LLM inference service for empl...

You’re on-call for an internal LLM chat service. M...

You operate a GPU-backed LLM service that uses con...

Your company’s internal LLM service handles many c...

You operate an internal LLM inference service for employees. Traffic has two dominant patterns: (1) many requests start with the same 200-token “policy + tool instructions” prefix and then diverge, and (2) a smaller number of ad‑hoc requests have long, unique prompts (2,000–6,000 tokens). The service uses continuous batching and must keep p95 latency stable.

A proposed redesign includes: (a) prefix caching that stores the KV cache state for the shared 200-token prefix so future requests can skip recomputing that portion of the prompt, and (b) PagedAttention (paged KV caching) so each sequence’s KV cache grows in fixed-size pages rather than requiring a single contiguous allocation.

Write an evaluation that explains, in one coherent argument, how this redesign changes GPU compute and memory behavior across BOTH the prefilling phase and the token-by-token decoding phase. Your answer must:
- Explain what work is avoided (and what is not avoided) when a request hits the prefix cache, and how that changes prefilling cost and time-to-first-token.
- Explain why decoding still depends on the KV cache and how KV cache growth during decoding interacts with variable output lengths.
- Analyze how memory fragmentation can arise in a standard contiguous KV allocation scheme under this workload, and how paged KV allocation changes the failure/throughput profile.
- Identify at least two concrete tradeoffs/risks introduced by combining prefix caching with paged KV caching (e.g., memory overhead, eviction behavior, page table/indirection costs, cache hit-rate sensitivity), and recommend one operational policy (e.g., what to cache/evict or when to disable caching) to keep p95 latency stable.

Assume the model is an autoregressive Transformer decoder and that the KV cache stores keys/values for all previously processed tokens (prompt + generated tokens).

Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths

You operate an internal LLM inference service for a company knowledge assistant. Traffic has two dominant patterns: (1) many users start chats with the same 300-token “policy + safety + tool instructions” system prompt, then ask different questions; (2) a smaller set of power users submit long, unique prompts (2,000–4,000 tokens). The server uses continuous batching and must keep p95 latency low. Recently, you observe that GPU memory monitoring often shows ~25% free memory, yet new long requests intermittently fail to start or cause sharp throughput drops after the system has been running for hours.

Write an evaluation recommending a concrete inference-time caching and memory-management approach that addresses both compute and memory issues. Your answer must explain, in one coherent argument, how (a) KV cache growth differs between the initial prompt processing and token-by-token generation, (b) prefix caching changes the amount of prefilling work for shared-prefix requests and what it costs in memory, and (c) memory fragmentation can cause “free memory but allocation failure,” including how paged KV caching (PagedAttention) would change allocation behavior. Conclude with a justified recommendation (e.g., enable/disable prefix caching, use paged KV caching, and any constraints such as eviction policy or what to cache) and explicitly discuss the tradeoffs you are accepting.

Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure

You operate a GPU-based LLM inference service that uses continuous batching to serve many concurrent chat sessions. Each request has (a) a prompt that must be processed before generation starts and (b) a variable-length generated response. Production telemetry shows two symptoms: (1) latency spikes occur when many new requests arrive with long prompts that share a common system prefix (e.g., the same 200-token policy header), and (2) after several hours of mixed traffic, the service sometimes fails to admit a new long request even though ~25–35% of GPU memory is reported free.

Write an engineering recommendation memo that proposes a coherent end-to-end approach to reduce both the latency spikes and the admission failures. Your memo must explicitly connect: how the KV cache is created and grows across the prompt-processing stage versus token-by-token generation; how reusing KV states for shared prompt prefixes changes the amount of prompt work performed; why the observed “free memory but cannot allocate” symptom can occur in KV-cache allocation; and how a paged/block-based KV-cache allocator would change the failure mode and memory utilization. Conclude by stating at least two concrete tradeoffs/risks (e.g., memory overhead, eviction policy complexity, access patterns) and how you would validate the improvement with metrics or experiments.

Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack

You are on-call for an internal LLM chat-completions service used by multiple product teams. Traffic has two dominant patterns: (1) many requests share an identical 250-token system prompt (policy + formatting) but have different user messages; (2) a smaller set of power users send very long, unique prompts (2,000–6,000 tokens). The service uses continuous batching and standard contiguous KV-cache allocation per sequence.

Symptoms over a 2-hour window:
- P50 time-to-first-token (TTFT) is good, but P99 TTFT spikes when long prompts arrive.
- During spikes, GPU monitoring shows ~25–35% total memory free, yet new long requests sometimes fail to start with an out-of-memory/allocation error.
- When failures happen, short requests still decode, but overall throughput drops.

You are allowed to change only inference-time memory/caching strategy (no model changes). Propose a concrete design that addresses BOTH (a) the TTFT spikes and (b) the allocation failures, using KV-cache behavior across prefilling vs decoding, prefix caching, and a fragmentation-aware KV memory scheme. In your answer, explain the causal chain from the current design to the observed symptoms, and justify the tradeoffs your design makes (e.g., memory overhead vs compute saved, and any impact on decoding performance).

Stabilizing latency and GPU memory in a chat-completions service with shared system prompts

You are the on-call engineer for an internal LLM gateway that serves two high-volume products on the same GPU pool: (A) a customer-support chat agent and (B) a report generator. Both products use the same 220-token system prompt, but user prompts vary from 20–2,000 tokens. Typical outputs are 50 tokens for (A) and 1,500 tokens for (B). The serving stack uses continuous batching and stores each request’s KV cache in a single contiguous allocation that grows as decoding proceeds.

Over the last week, you observe two symptoms that often occur together during peak hours:
1) New long requests fail to start with an out-of-memory error even when monitoring shows ~25% of GPU memory is free.
2) P99 token latency during streaming generation increases steadily over time, especially when many long outputs are in flight.

A teammate proposes a quick fix: “Enable prefix caching for the shared system prompt; that will reduce compute and should also fix the memory issues.” Another teammate proposes: “Switch KV cache allocation to a paged/block-based scheme (PagedAttention-style) to eliminate fragmentation; prefix caching is optional.”

As the incident lead, choose which proposal you would implement first (prefix caching first vs paged KV caching first), and justify your decision by explicitly connecting: (i) what happens in prefilling vs decoding, (ii) how the KV cache grows and is reused across decoding steps, (iii) why the system can OOM despite free memory (fragmentation), and (iv) how your chosen change affects both memory behavior and end-to-end latency for these two products. Your answer should also name one concrete risk/tradeoff introduced by your chosen change.

Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic

You are the on-call engineer for an internal LLM gateway that serves two workloads on the same GPU pool: (A) a chat product where every request begins with the same 600-token system prompt, and (B) an agent workflow that sends highly variable prompts (50–4000 tokens) and often streams 800–1500 generated tokens. The serving stack uses continuous batching and stores each sequence’s KV cache in GPU memory during generation.

After a traffic spike, you observe the following symptoms over a 30-minute window:
1) Median time-to-first-token (TTFT) increases sharply, but tokens/second during streaming generation degrades only mildly.
2) GPU memory monitoring shows ~25% free memory, yet new long agent requests frequently fail to start with an out-of-memory allocation error.
3) When you temporarily disable reuse of the shared 600-token system prompt (i.e., you always recompute it per request), TTFT gets worse but the OOM allocation failures become less frequent.

Assume the model is a standard autoregressive Transformer with a KV cache; inference consists of an initial prompt-processing stage that populates the KV cache followed by token-by-token generation that appends to the KV cache.

As the incident owner, propose ONE coherent serving change (a single design choice, not a list) that best explains and addresses all three symptoms at once. Your answer must (i) identify the most likely root cause linking TTFT behavior and the “free memory but OOM” paradox, and (ii) justify why your chosen change improves the situation by explicitly referencing how it affects KV-cache allocation during prompt processing vs. token-by-token generation, and how it interacts with shared-prefix reuse.

Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service

The decoding network responsible for generating the Key-Value (KV) cache, denoted as $$\mathrm{Dec}_{\mathrm{kv}}(\cdot)$$, shares the identical underlying architecture with the standard decoding network used for token prediction. The primary distinction lies in its output: instead of returning the standard output representations for tokens, this specialized network explicitly returns the multi-layered KV cache produced within the self-attention mechanisms during processing.

Learn Before

Related

Learn After