What fundamental characteristic of the initial prompt processing stage allows for this high level of computational efficiency, and why does this characteristic not apply to the word-by-word generation phase?

Google

A key characteristic of the prefilling phase is its ability to process the entire input sequence simultaneously. This allows for a highly parallelized self-attention computation where all query vectors are grouped into a single matrix, $\mathbf{Q}$. This approach makes efficient use of the parallel computing capabilities of modern GPUs, which significantly speeds up the prefilling process.

Parallel Self-Attention in the Prefilling Phase

During the prefilling phase, self-attention is computed for the entire input sequence in a single operation. The query, key, and value vectors are represented as matrices $\mathbf{Q}, \mathbf{K}, \mathbf{V} \in \mathbb{R}^{d \times (m+1)}$. The attention output is calculated using the scaled dot-product formula: $$\text{Att}_{\text{qkv}}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Softmax}\left(\frac{\mathbf{QK}^{\text{T}}}{\sqrt{d}} + \text{Mask}\right)\mathbf{V}$$ Here, the causal mask, $\text{Mask} \in \mathbb{R}^{(m+1) \times (m+1)}$, prevents tokens from attending to future positions by setting the corresponding entries in the attention score matrix to a large negative number (e.g., $-\infty$) before the Softmax function is applied.

Self-Attention Formula for the Prefilling Phase

The prefilling phase is generally considered a compute-bound process. This is because the parallel computation of self-attention for the entire sequence merges many operations into a single, large one. This approach minimizes data transfers between memory and the processing unit (like a GPU), meaning the primary performance limitation becomes the raw computational power of the hardware, rather than the speed at which data can be moved (memory bandwidth).

Prefilling as a Compute-Bound Process

The prefilling phase involves a parallel computation where the entire input sequence is processed at once to generate the KV cache. A key outcome of this process is the determination of the probability distribution for the first output token. Furthermore, in certain scenarios, this phase can extend to predict subsequent tokens, such as the second output token.

Token Prediction within the Prefilling Phase

When a large language model first processes a user's prompt, it can perform calculations for all words in the prompt simultaneously rather than one by one. What is the fundamental condition that makes this highly parallel approach possible during this initial stage?

LLM Inference Performance Analysis

A key computational advantage during the initial processing of a prompt is the ability to perform calculations for all input tokens simultaneously. Explain the fundamental reason why this high degree of parallelism is possible at this stage. In your explanation, contrast this with a situation where tokens must be processed one at a time.

Rationale for Parallelism in Initial Prompt Processing

This diagram illustrates the data flow during the prefilling stage of a Transformer. The entire input sequence, represented as tokens `x0` through `xm-1`, is initially converted into vectors by an Embedding Layer. Following this, a self-attention layer processes all these vectors simultaneously. In this parallel operation, the layer generates a complete set of query vectors (`q0` to `qm-1`), key vectors (`k0` to `km-1`), and value vectors (`v0` to `vm-1`) for the entire input sequence in a single step. This 'processed all at once' approach is the defining characteristic of the prefilling phase.

Learn Before

Related