Learn Before
Concept

Self-attention layers' first approach

a self-attention layer maps input sequences(x1,...,xn) to output sequences of the same length (y1,...,yn{y_1},...,{y_n}). When processing each item in the input, the model has access to all of the inputs up to and including the one under consideration, but no access to information about inputs beyond the current one. In the case of self-attention, the set of comparisons are to other elements within a given sequence. The simplest form of comparison between elements in a self-attention layer is a dot product: score(xi,xj)=xixjscore({x_i}, {x_j}) = {x_i}· {x_j} The larger the value the more similar the vectors that are being compared. Then to make effective use of these scores, we’ll normalize them with a softmax to create a vector of weights, αijα_{ij}, that indicates the proportional relevance of each input to the input element i that is the current focus of attention. αij=exp(score(xi,xj))k=1iexp(score(xi,xk))α_{ij} = \frac{exp(score({x_i}, {x_j}))}{\sum_{k=1}^i exp(score({x_i}, {x_k}))}, {\forall} j ≤ i Given the proportional scores in α, we then generate an output value yi by taking the sum of the inputs seen so far, weighted by their respective α value. yi=jiαijxj{y_i} =\sum_{j≤i} α_{ij}{x_j}

Image 0

0

1

Updated 2026-05-02

Tags

Data Science

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related