Case Study

Calculating Pre-Normalized Attention Scores

A language model is calculating attention scores to determine the influence of two previous tokens (at positions j=2 and j=4) on the current token being generated (at position i=5). The score before normalization is calculated by adding a query-key similarity value to a relative positional encoding value. Based on the data provided in the case study, which of the two previous tokens will receive a higher attention score? Justify your answer by calculating the pre-normalized score for each position.

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science