Case Study

Impact of Masking on Attention Weight Distribution

A model is processing a sequence of three input tokens. For the query vector at position 2, the un-normalized attention scores calculated with respect to the key vectors at positions 0, 1, and 2 are [2.5, 1.0, 4.0]. Now, imagine a mechanism is applied that prevents the query at position 2 from attending to itself. This is achieved by adding a very large negative number (e.g., -10,000) to the score for position 2 before the normalization step that converts scores into weights. Analyze and explain the resulting effect on the final, normalized attention weights for the query at position 2.

0

1

Updated 2025-10-05

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science