Implicit Relative Position Modeling in Self-Attention with RoPE
When Rotary Positional Embeddings (RoPE) are applied to query and key vectors, the self-attention mechanism inherently captures relative positional context. Specifically, if the RoPE-encoded vectors Ro(x, tθ) and Ro(y, sθ) are treated as the query and key respectively, the self-attention operation implicitly models the relative positions. This is because their dot product is a function of the relative displacement, .

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Derivation of the Dot Product for RoPE-Encoded Vectors
Implicit Relative Position Modeling in Self-Attention with RoPE
A language model uses a positional encoding scheme with a specific mathematical property: the dot product between the encoded representations of any two tokens is a function solely of the difference between their positions in the sequence. Which of the following statements most accurately analyzes the primary advantage of this property for processing language?
In a system that encodes token positions by rotating their vector representations, the dot product between the encoded vector for a token at position
tand another at positionsis found to be dependent only on their relative displacement(t-s). Based on this property, the dot product calculated for a pair of tokens at positions 5 and 8 would be identical to the dot product for the same pair of tokens if they were located at positions 15 and 18.Diagnosing a Positional Encoding Flaw
Query (Attention)
Key (Attention)
Value (Attention)
State Function from Previous Outputs
Value Weight Matrix Formula
Set of Sequential Key-Value Pairs
Query Vector
Key Vector
Value Vector
Implicit Relative Position Modeling in Self-Attention with RoPE
Value Weight Matrix Definition ()
Imagine a system translating the sentence 'The quick brown fox jumps'. When the system is generating the output word corresponding to 'jumps', it needs to determine which words in the input sentence are most relevant. To do this, a vector representing the current translation context (i.e., 'what information do I need to produce the next word?') is compared against a set of searchable 'label' vectors, one for each word in the input sentence. This comparison generates a relevance score for each input word. Finally, a new vector is created by taking a weighted average of the 'content' vectors of the input words, using the relevance scores as weights. How do the three main vector types in this process correspond to their roles?
In a system designed to answer questions based on a provided document, the model first creates a representation of the user's question. It then compares this representation against a set of searchable representations, one for each sentence in the document, to determine relevance scores. Finally, it constructs an answer by creating a weighted combination of the informational content from each sentence, using the relevance scores as weights. Which option correctly assigns the roles of Query, Key, and Value vectors in this scenario?
Context Window of Key Vectors Notation
Key-Value Cache
In a computational mechanism designed to selectively focus on different parts of an input sequence, information is represented by three distinct types of vectors that interact to produce a context-aware output. Match each vector type to its specific role in this process.
Masked QKV Attention Formula
Learn After
An attention mechanism incorporates positional information by applying a unique rotation to each query and key vector based on its absolute position in a sequence. The attention score between a query from position 't' and a key from position 's' is then computed. A key property of this rotation is that the dot product between the rotated query and key vectors is a function of the original vectors and the difference in their positions (t-s). Based on this information, what can be concluded about the attention scores produced by this mechanism?
Analysis of a Positional Encoding Method
Evaluating a Model's Performance Discrepancy