Learn Before
Self-Attention Output Formula
Given a sequence of input tokens where each token for , the self-attention mechanism produces an output sequence of the same length, denoted as . Each output vector is computed by treating the token as the query, and the entire sequence of tokens as both the keys and the values. This is mathematically defined as:
where represents the general attention pooling function.
0
1
Tags
D2L
Dive into Deep Learning @ D2L
Related
Attention Weight Matrix (α)
Sparse Attention
Self-attention layers' first approach
In a general attention mechanism, the output is calculated as a weighted sum of the Value vectors, where the weights are determined by the interaction between Query and Key vectors. The standard formula is: . Consider a scenario where this formula is mistakenly altered to be: . What is the most significant consequence of this modification?
Dimensional Analysis of the Attention Formula
Applying the Attention Mechanism Roles
Self-Attention Output Formula for a Single Query
Self-Attention Output Formula