Learn Before
Formula

Additive Attention Scoring Function

Additive attention is a scoring function designed for situations where queries and keys reside in vector spaces of differing dimensionality, making a direct dot product infeasible. Introduced by Bahdanau et al. (2014), it projects both the query qRq\mathbf{q} \in \mathbb{R}^q and the key kRk\mathbf{k} \in \mathbb{R}^k into a shared hidden space of dimension hh using separate learned weight matrices WqRhimesq\mathbf{W}_q \in \mathbb{R}^{h imes q} and WkRhimesk\mathbf{W}_k \in \mathbb{R}^{h imes k}. These projections are summed element-wise and passed through a anh anh nonlinearity, after which a learned weight vector wvRh\mathbf{w}_v \in \mathbb{R}^h reduces the result to a scalar attention score:

a(q,k)=wvopanh(Wqq+Wkk)Ra(\mathbf{q}, \mathbf{k}) = \mathbf{w}_v^ op anh(\mathbf{W}_q \mathbf{q} + \mathbf{W}_k \mathbf{k}) \in \mathbb{R}

This score is subsequently fed into a softmax function to produce nonnegative, normalized attention weights. An equivalent interpretation views additive attention as concatenating the query and key and feeding them through an MLP with a single hidden layer using anh anh activation. As its name suggests, the computation is additive rather than multiplicative, which can yield minor computational savings compared to approaches that require matrix products between query and key vectors.

0

1

Updated 2026-05-14

Tags

Data Science

D2L

Dive into Deep Learning @ D2L