1Cademy - Calculating a Pre-Softmax Attention Score with Positional Bias

Learn Before

Complete ALiBi Attention Formula

Case Study

Calculating a Pre-Softmax Attention Score with Positional Bias

Based on the provided scenario and formula, calculate the final pre-Softmax attention score. Break down your calculation to show how each component (scaled dot product, positional bias, and mask) contributes to the final result.

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

A language model computes its pre-normalized attention scores using the formula: Score = (query_vector ⋅ key_vector + β ⋅ (key_position - query_position)) / sqrt(dimension). In this model, the scalar hyperparameter β is a fixed negative number. Consider a query token at position i=10. How does the bias term β ⋅ (key_position - query_position) influence the scores for a key token at position j=12 compared to a key token at position j=20, assuming all other components of the score are equal for both keys?
Calculating a Pre-Softmax Attention Score with Positional Bias
In a language model using the complete ALiBi attention formula for causal text generation, the model needs to prevent a query token at position i from attending to any key token at a future position j (where j > i). How does the Mask(i, j) term within the formula α(i, j) = Softmax((q_iᵀk_j + β⋅(j-i))/√d + Mask(i, j)) achieve this?
Modeling Arbitrarily Long Sequences with ALiBi
Tuning the ALiBi Bias Scalar ( $\beta$ )

Learn Before

Related