Learn Before
Self-Attention Formula for the Prefilling Phase
During the prefilling phase, self-attention is computed for the entire input sequence in a single operation. The query, key, and value vectors are represented as matrices . The attention output is calculated using the scaled dot-product formula: Here, the causal mask, , prevents tokens from attending to future positions by setting the corresponding entries in the attention score matrix to a large negative number (e.g., ) before the Softmax function is applied.

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Self-Attention Formula for the Prefilling Phase
Prefilling as a Compute-Bound Process
Token Prediction within the Prefilling Phase
When a large language model first processes a user's prompt, it can perform calculations for all words in the prompt simultaneously rather than one by one. What is the fundamental condition that makes this highly parallel approach possible during this initial stage?
LLM Inference Performance Analysis
Rationale for Parallelism in Initial Prompt Processing
Diagram of the Prefilling Phase
Learn After
The scaled dot-product attention formula,
Softmax((QK^T / sqrt(d)) + Mask)V, is used when an entire input sequence is available for simultaneous processing. Which specific operation within this formula directly represents the parallel computation of interaction scores between every possible pair of tokens in the sequence, a step that is only feasible because the entire input is present at once?Optimizing Prefilling Phase Performance
Consequences of Removing the Causal Mask