1Cademy - Self-attention layers first approach

Learn Before

Concept

Self-attention layers' first approach

a self-attention layer maps input sequences(x1,...,xn) to output sequences of the same length ( ${y_1},...,{y_n}$ ). When processing each item in the input, the model has access to all of the inputs up to and including the one under consideration, but no access to information about inputs beyond the current one. In the case of self-attention, the set of comparisons are to other elements within a given sequence. The simplest form of comparison between elements in a self-attention layer is a dot product: $score({x_i}, {x_j}) = {x_i}· {x_j}$ The larger the value the more similar the vectors that are being compared. Then to make effective use of these scores, we’ll normalize them with a softmax to create a vector of weights, $α_{ij}$ , that indicates the proportional relevance of each input to the input element i that is the current focus of attention. $α_{ij} = \frac{exp(score({x_i}, {x_j}))}{\sum_{k=1}^i exp(score({x_i}, {x_k}))}$ , ${\forall}$ j ≤ i Given the proportional scores in α, we then generate an output value yi by taking the sum of the inputs seen so far, weighted by their respective α value. ${y_i} =\sum_{j≤i} α_{ij}{x_j}$

Updated 2026-05-02

Contributors are:

Who are from:

University of Michigan - Ann Arbor

🏆 4

Google

✔️ 1

References

Learn Before

Related

Learn After