Case Study

Applying a Causal Mask to Attention Scores

A language model is processing a sequence of 4 tokens. After calculating the scaled dot-product scores between queries and keys, it produces the following 4x4 matrix of pre-softmax attention scores:

[[2.1, 1.5, 0.8, 1.2], [1.8, 2.5, 1.1, 0.9], [0.7, 1.3, 2.2, 1.9], [1.4, 0.6, 1.7, 2.8]]

To ensure that the model only considers previous tokens when making a prediction, a causal mask is added to this matrix before the final softmax normalization step. Based on this process, what will be the value of the element at row index 1, column index 2 (the value 1.1) after the causal mask is applied, and why?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science