Example

Visual Example of a Linear Relative Position Bias in Causal Attention

In causal self-attention, a linear relative position bias is applied to penalize attention to distant past tokens. The bias for a query at position ii and a key at position jj is calculated as β(ij)-\beta(i - j), where β\beta is a scalar parameter. This bias is only applied to valid query-key pairs where jij \le i, enforcing causality. For example, the set of computed query-key dot products for a sequence of length 7 (indexed 0-6) would form a lower-triangular structure: q0kT0, q1kT0, q1kT1, ..., q6kT0, ..., q6kT6. The bias added to each of these dot products would be zero for self-attention (e.g., q2kT2) and become increasingly negative for more distant pairs (e.g., the bias for q6kT0 would be more negative than for q6kT5).

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences