Example

Linear Relative Position Bias Example

A linear relative position bias scheme incorporates sequence order into attention mechanisms by adding a penalty term, calculated as β(ij)-\beta(i-j), to the query-key dot product. In this formula, (ij)(i-j) is the relative distance between the query and key, and β\beta is a scalar, resulting in a penalty that grows linearly with distance. In a causal attention setting, where a query only attends to previous keys, the bias values for different maximum relative distances are as follows:

  • For relative distances of 3, 2, 1, and 0, the biases are: 3β,2β,1β,0-3\beta, -2\beta, -1\beta, 0
  • For relative distances of 4, 3, 2, 1, and 0, the biases are: 4β,3β,2β,1β,0-4\beta, -3\beta, -2\beta, -1\beta, 0
  • For relative distances of 5, 4, 3, 2, 1, and 0, the biases are: 5β,4β,3β,2β,1β,0-5\beta, -4\beta, -3\beta, -2\beta, -1\beta, 0
  • For relative distances of 6, 5, 4, 3, 2, 1, and 0, the biases are: 6β,5β,4β,3β,2β,1β,0-6\beta, -5\beta, -4\beta, -3\beta, -2\beta, -1\beta, 0

This pattern shows that the bias is zero for self-attention (when i=ji=j) and increases as a negative penalty for positions that are further apart.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences