1Cademy - Calculating Pre-Normalized Attention Scores

Learn Before

Attention Weight with Relative Positional Encoding

Case Study

Calculating Pre-Normalized Attention Scores

A language model is calculating attention scores to determine the influence of two previous tokens (at positions j=2 and j=4) on the current token being generated (at position i=5). The score before normalization is calculated by adding a query-key similarity value to a relative positional encoding value. Based on the data provided in the case study, which of the two previous tokens will receive a higher attention score? Justify your answer by calculating the pre-normalized score for each position.

Updated 2025-10-03

Contributors are:

Who are from:

Learn Before

Related