1Cademy - Applying a Causal Mask to Attention Scores

Learn Before

Causal Attention Weight Matrix Calculation

Case Study

Applying a Causal Mask to Attention Scores

A language model is processing a sequence of 4 tokens. After calculating the scaled dot-product scores between queries and keys, it produces the following 4x4 matrix of pre-softmax attention scores:

[[2.1, 1.5, 0.8, 1.2], [1.8, 2.5, 1.1, 0.9], [0.7, 1.3, 2.2, 1.9], [1.4, 0.6, 1.7, 2.8]]

To ensure that the model only considers previous tokens when making a prediction, a causal mask is added to this matrix before the final softmax normalization step. Based on this process, what will be the value of the element at row index 1, column index 2 (the value 1.1) after the causal mask is applied, and why?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related