Case Study

Analyzing Attention Weight Redistribution

An engineer is analyzing a language model's attention mechanism. For a specific query token, the model initially calculates the following normalized attention weights (dense attention) over four previous tokens:

  • Token 1: 0.1
  • Token 2: 0.5
  • Token 3: 0.3
  • Token 4: 0.1

To improve efficiency, the engineer modifies the mechanism to only attend to Token 2 and Token 3, ignoring Tokens 1 and 4. The attention scores for Tokens 2 and 3 are re-calculated and re-normalized over this smaller set.

Predict the effect of this change on the attention weight for Token 2. Will the new, sparse attention weight for Token 2 be greater than, less than, or equal to its original weight of 0.5? Justify your answer based on the principle of probability mass distribution.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science