1Cademy - Impact of Masking on Attention Weight Distribution

Learn Before

Calculating Attention Weights (αi,j) in Transformers

Case Study

Impact of Masking on Attention Weight Distribution

A model is processing a sequence of three input tokens. For the query vector at position 2, the un-normalized attention scores calculated with respect to the key vectors at positions 0, 1, and 2 are [2.5, 1.0, 4.0]. Now, imagine a mechanism is applied that prevents the query at position 2 from attending to itself. This is achieved by adding a very large negative number (e.g., -10,000) to the score for position 2 before the normalization step that converts scores into weights. Analyze and explain the resulting effect on the final, normalized attention weights for the query at position 2.

0

1

Updated 2025-10-05

Contributors are:

Who are from:

Learn Before

Related