Based on your understanding of the self-attention computation used when an entire input sequence is available at once, analyze the following scenario and explain the fundamental flaw in the engineer's reasoning.

Google

During the prefilling phase, self-attention is computed for the entire input sequence in a single operation. The query, key, and value vectors are represented as matrices $$\mathbf{Q}, \mathbf{K}, \mathbf{V} \in \mathbb{R}^{d \times (m+1)}$$. The attention output is calculated using the scaled dot-product formula: $$\text{Att}_{\text{qkv}}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Softmax}\left(\frac{\mathbf{QK}^{\text{T}}}{\sqrt{d}} + \text{Mask}\right)\mathbf{V}$$ Here, the causal mask, $$\text{Mask} \in \mathbb{R}^{(m+1) \times (m+1)}$$, prevents tokens from attending to future positions by setting the corresponding entries in the attention score matrix to a large negative number (e.g., $$-\infty$$) before the Softmax function is applied.

Self-Attention Formula for the Prefilling Phase

The scaled dot-product attention formula, `Softmax((QK^T / sqrt(d)) + Mask)V`, is used when an entire input sequence is available for simultaneous processing. Which specific operation within this formula directly represents the parallel computation of interaction scores between every possible pair of tokens in the sequence, a step that is only feasible because the entire input is present at once?

Optimizing Prefilling Phase Performance

In the context of the self-attention formula used during the prefilling phase, `Att(Q, K, V) = Softmax((QK^T / sqrt(d)) + Mask)V`, what would be the direct consequence on the model's information flow if the `Mask` term were omitted (i.e., treated as a matrix of all zeros)? Explain why this outcome is fundamentally incompatible with the goal of training an auto-regressive language model.

Learn Before

Related