Formula

Self-Attention Output Formula

Given a sequence of input tokens x1,,xn\mathbf{x}_1, \ldots, \mathbf{x}_n where each token xiRd\mathbf{x}_i \in \mathbb{R}^d for 1in1 \leq i \leq n, the self-attention mechanism produces an output sequence of the same length, denoted as y1,,yn\mathbf{y}_1, \ldots, \mathbf{y}_n. Each output vector yi\mathbf{y}_i is computed by treating the token xi\mathbf{x}_i as the query, and the entire sequence of tokens as both the keys and the values. This is mathematically defined as:

yi=f(xi,(x1,x1),,(xn,xn))Rd\mathbf{y}_i = f(\mathbf{x}_i, (\mathbf{x}_1, \mathbf{x}_1), \ldots, (\mathbf{x}_n, \mathbf{x}_n)) \in \mathbb{R}^d

where ff represents the general attention pooling function.

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L