1Cademy - Additive Attention Scoring Function

Learn Before

Attention Scoring Functions

Formula

Additive Attention Scoring Function

Additive attention is a scoring function designed for situations where queries and keys reside in vector spaces of differing dimensionality, making a direct dot product infeasible. Introduced by Bahdanau et al. (2014), it projects both the query $\mathbf{q} \in \mathbb{R}^q$ and the key $\mathbf{k} \in \mathbb{R}^k$ into a shared hidden space of dimension $h$ using separate learned weight matrices $\mathbf{W}_q \in \mathbb{R}^{h imes q}$ and $\mathbf{W}_k \in \mathbb{R}^{h imes k}$ . These projections are summed element-wise and passed through a $anh$ nonlinearity, after which a learned weight vector $\mathbf{w}_v \in \mathbb{R}^h$ reduces the result to a scalar attention score:

$a(\mathbf{q}, \mathbf{k}) = \mathbf{w}_v^ op anh(\mathbf{W}_q \mathbf{q} + \mathbf{W}_k \mathbf{k}) \in \mathbb{R}$

This score is subsequently fed into a softmax function to produce nonnegative, normalized attention weights. An equivalent interpretation views additive attention as concatenating the query and key and feeding them through an MLP with a single hidden layer using $anh$ activation. As its name suggests, the computation is additive rather than multiplicative, which can yield minor computational savings compared to approaches that require matrix products between query and key vectors.

0

1

Updated 2026-05-14

Contributors are:

Who are from:

University of California, Berkeley

✔️ 1

References

Learn Before

Related