Formula

Causal Attention Weight Matrix Calculation

In a causal attention mechanism, the attention weight matrix, denoted as α(Q,K)\alpha(\mathbf{Q}, \mathbf{K}), is computed using the formula: α(Q,K)=Softmax(QKTd+Mask)\alpha(\mathbf{Q}, \mathbf{K}) = \text{Softmax}\left(\frac{\mathbf{QK}^{\text{T}}}{\sqrt{d}} + \text{Mask}\right) This operation yields a lower triangular matrix of size m×mm \times m, where mm is the sequence length. The mask ensures that any element αi,j\alpha_{i,j} is zero if j>ij > i, preventing any position from attending to future positions. Each row vector in this matrix, such as (αi,0,,αi,i,0,,0)(\alpha_{i,0}, \dots, \alpha_{i,i}, 0, \dots, 0), represents the probability distribution of attention for the i-th token over all preceding tokens in the sequence. The structure of this matrix is as follows: α(Q,K)=[α0,0000α1,0α1,100α2,0α2,1α2,20αm1,0αm1,1αm1,2αm1,m1]\alpha(\mathbf{Q}, \mathbf{K}) = \begin{bmatrix} \alpha_{0,0} & 0 & 0 & \dots & 0 \\ \alpha_{1,0} & \alpha_{1,1} & 0 & \dots & 0 \\ \alpha_{2,0} & \alpha_{2,1} & \alpha_{2,2} & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \alpha_{m-1,0} & \alpha_{m-1,1} & \alpha_{m-1,2} & \dots & \alpha_{m-1,m-1} \end{bmatrix}

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related