Activity (Process)

Calculating Attention Weights (αi,j) in Transformers

The attention weight, denoted as αi,j\alpha_{i,j}, quantifies the relevance of position jj to position ii. In Transformer models, this weight is derived by applying a normalization function to the attention score, βi,j\beta_{i,j}. The attention score itself is the rescaled dot product of the query vector qi\mathbf{q}_i and the key vector kj\mathbf{k}_j, potentially including a mask.

Image 0

0

1

Updated 2026-04-22

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences