Formula

Complete ALiBi Attention Formula

The final attention weight in the ALiBi framework, denoted as α(i,j)\alpha(i, j), is computed by applying the Softmax function to the attention score. This score is derived by adding the ALiBi positional bias term, β(ji)\beta \cdot (j - i), to the standard query-key product qikjT\mathbf{q}_i \mathbf{k}_j^{\mathrm{T}}, scaling the sum by the inverse square root of the dimension dd, and incorporating an optional mask. The complete equation is expressed as: α(i,j)=Softmax(qikjT+β(ji)d+Mask(i,j))\alpha(i,j) = \mathrm{Softmax}\left(\frac{\mathbf{q}_i \mathbf{k}_{j}^{\mathrm{T}} + \beta \cdot (j - i)}{\sqrt{d}} + \mathrm{Mask}(i,j)\right) In this formula, qi\mathbf{q}_i and kj\mathbf{k}_j denote the query and key vectors, and β\beta acts as a scaling factor. The Mask(i,j)\mathrm{Mask}(i, j) term ensures proper attention masking when required.

Image 0

0

1

Updated 2026-04-24

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences