Self-Attention Output Formula for a Single Query
In the Query-Key-Value (QKV) attention mechanism, the output for an individual query vector is determined by calculating a weighted sum of all value vectors in the sequence. For a sequence of length , this operation is mathematically defined as: Here, is the normalized attention weight that quantifies the relationship between the query at position and the key at position , while represents the value vector at position .
0
1
Tags
Foundations of Large Language Models
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Attention Weight Matrix (α)
Sparse Attention
Self-attention layers' first approach
In a general attention mechanism, the output is calculated as a weighted sum of the Value vectors, where the weights are determined by the interaction between Query and Key vectors. The standard formula is: . Consider a scenario where this formula is mistakenly altered to be: . What is the most significant consequence of this modification?
Dimensional Analysis of the Attention Formula
Applying the Attention Mechanism Roles
Self-Attention Output Formula for a Single Query
In a self-attention mechanism, the raw attention scores (β) for a single query vector with respect to three key vectors are calculated as [2.0, 1.0, 0.5]. To convert these scores into a probability distribution, a normalization function is applied. What is the resulting normalized attention weight (α) corresponding to the first key vector (score of 2.0)?
In a self-attention mechanism, a set of raw, unnormalized attention scores for a specific query are [1.5, 0.5, -1.0]. If a constant value of 10 is added to each of these scores, resulting in a new set of scores [11.5, 10.5, 9.0], how will the final normalized attention weights (the probability distribution) calculated from the new scores compare to the weights calculated from the original scores?
Calculating and Interpreting Attention Weights
Self-Attention Output Formula for a Single Query
Computing Attention Weights in Sequence Parallelism
Distributed Attention Weight Formula