Learn Before
Causal Attention Mask Matrix Definition
In self-attention mechanisms where queries, keys, and values are represented by matrices , a masking variable is used to ensure that token prediction is based only on preceding tokens. This is achieved with a mask matrix, . The value of an entry at row i and column k of this matrix is defined as 0 if k ≤ i (allowing attention to current and past positions) and -∞ if k > i (prohibiting attention to future positions). This mask is added to the attention scores before the softmax activation.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.5 Inference - Foundations of Large Language Models
Related
Causal Attention Input Structure
Causal Attention Mask Matrix Definition
Causal Attention Weight Matrix Calculation
An engineer is implementing an attention mechanism where the output is a weighted sum of Value vectors, with weights determined by a Softmax function applied to scores. They observe that as the dimension (
d) of the Query and Key vectors increases, the attention weights become extremely concentrated on a single position (e.g.,[0.01, 0.98, 0.01]), causing training instability. The scores are derived from the dot product of Query (Q) and Key (K) matrices. What is the most likely cause of this issue?Attention Mechanism Misapplication in Summarization
Analyzing the Role of the Mask in Attention
Selecting an Attention Design for Long-Context, Low-Latency Inference
Diagnosing and Redesigning Attention for a Long-Context, Cost-Constrained LLM Service
Choosing an Attention Stack for a Regulated, Long-Document Review Assistant
Attention Redesign for a Long-Context Customer-Support Copilot Under GPU Memory Pressure
Attention Redesign for a Multi-Tenant LLM with Long Context and Strict KV-Cache Budgets
Attention Architecture Choice for On-Device Meeting Summarization with 60k Context
You’re debugging an LLM inference service that mus...
You’re reviewing a design doc for a Transformer at...
Your team is deploying a chat-based LLM that must ...
You’re leading an LLM platform team that must supp...
Learn After
In a self-attention mechanism designed for autoregressive tasks, a sequence of 5 tokens is processed. The mechanism computes raw attention scores for each token relative to all other tokens. Before a final normalization step, a mask is added to these scores to prevent any token from attending to future tokens. For the 3rd token in the sequence, which vector correctly represents its scores for all 5 tokens after this causal mask has been applied? (Let
s_idenote the original raw score for the 3rd token attending to thei-th token).Rationale for Causal Mask Values
In a self-attention mechanism processing a sequence of 4 tokens, a mask is added to the raw attention scores to prevent any token from attending to subsequent (future) tokens. Which of the following 4x4 matrices correctly represents this mask?