This is similar to dot attention but also adds a learning aspect to it because we first pass the encoder vector through a Dense layer before taking the dot product. In this case attention is also is subject to backpropagation and gradient descent

$ score(h, h'_{t}) = (h'_{t})^{T} * W_{a} * h $

$h$ - encoder state
$h'_{t}$ - decoder state

University of California, Berkeley

There are major two types of attention - Those are Additive and Multiplicative attention. Those are also called Bahdanau and Luong attention based on the first authors of two papers introducing those methods:

- Additive (Bahdanau) in here we actually use previous decoder hidden state to calculate score for the next decoder state
- Multiplicative(Luong) - this one uses current decoder state to calculate score for the each of the decoder states

Attention Scoring Functions

Also another main papers on attention mechanism
Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.

Effective Approaches to Attention-based Neural Machine Translation

This is the third type of multiplicative attention. Here you can see the formula of that:

$ score(h, h'_{t}) = V^{T}_{a} * tanh( W_{a} * [h, h'_{t}] )$

$h$ - encoder vector
$h'_{t}$ - decoder vector 

So first we concatenate the encoder and decoder state. And we just add a usual Dense layer with tanh activation to the input and also add a layer with one unit to represent the score. So as we train this mechanism actually learn which words are most influential for the output words. 

Concat Attention Function

General Attention function

Dot product attention is a fundamental type of multiplicative attention based on dot similarity. Geometrically, the dot product measures the alignment between vectors: if a query and key share a similar direction, their dot product is higher, whereas orthogonal vectors yield a dot product of $$0$$. This implies that keys which are more conceptually related to the current query will receive larger attention scores. One notable characteristic of pure dot product attention is that it does not introduce any additional learnable parameters, relying entirely on the existing vector representations. The attention score is calculated mathematically using the standard query-key notation as: $$ a(\mathbf{q}, \mathbf{k}_i) = \mathbf{q}^	op \mathbf{k}_i $$ where $$\mathbf{q}$$ represents the query vector and $$\mathbf{k}_i$$ represents the key vector.

Dot Product Attention

Additive attention is a scoring function designed for situations where queries and keys reside in vector spaces of differing dimensionality, making a direct dot product infeasible. Introduced by Bahdanau et al. (2014), it projects both the query $$\mathbf{q} \in \mathbb{R}^q$$ and the key $$\mathbf{k} \in \mathbb{R}^k$$ into a shared hidden space of dimension $$h$$ using separate learned weight matrices $$\mathbf{W}_q \in \mathbb{R}^{h 	imes q}$$ and $$\mathbf{W}_k \in \mathbb{R}^{h 	imes k}$$. These projections are summed element-wise and passed through a $$	anh$$ nonlinearity, after which a learned weight vector $$\mathbf{w}_v \in \mathbb{R}^h$$ reduces the result to a scalar attention score:

$$a(\mathbf{q}, \mathbf{k}) = \mathbf{w}_v^	op 	anh(\mathbf{W}_q \mathbf{q} + \mathbf{W}_k \mathbf{k}) \in \mathbb{R}$$

This score is subsequently fed into a softmax function to produce nonnegative, normalized attention weights. An equivalent interpretation views additive attention as concatenating the query and key and feeding them through an MLP with a single hidden layer using $$	anh$$ activation. As its name suggests, the computation is additive rather than multiplicative, which can yield minor computational savings compared to approaches that require matrix products between query and key vectors.

Learn Before

Related