1Cademy - Causal Attention Weight Matrix Calculation

Learn Before

Formula

Causal Attention Weight Matrix Calculation

In a causal attention mechanism, the attention weight matrix, denoted as $\alpha(\mathbf{Q}, \mathbf{K})$ , is computed using the formula: $\alpha(\mathbf{Q}, \mathbf{K}) = \text{Softmax}\left(\frac{\mathbf{QK}^{\text{T}}}{\sqrt{d}} + \text{Mask}\right)$ This operation yields a lower triangular matrix of size $m \times m$ , where $m$ is the sequence length. The mask ensures that any element $\alpha_{i,j}$ is zero if $j > i$ , preventing any position from attending to future positions. Each row vector in this matrix, such as $(\alpha_{i,0}, \dots, \alpha_{i,i}, 0, \dots, 0)$ , represents the probability distribution of attention for the i-th token over all preceding tokens in the sequence. The structure of this matrix is as follows: $\alpha(\mathbf{Q}, \mathbf{K}) = \begin{bmatrix} \alpha_{0,0} & 0 & 0 & \dots & 0 \\ \alpha_{1,0} & \alpha_{1,1} & 0 & \dots & 0 \\ \alpha_{2,0} & \alpha_{2,1} & \alpha_{2,2} & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \alpha_{m-1,0} & \alpha_{m-1,1} & \alpha_{m-1,2} & \dots & \alpha_{m-1,m-1} \end{bmatrix}$

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After