Learn Before
Period Matching Equation for RoPE Base Scaling
The period matching equation is a fundamental condition for adapting Rotary Positional Embeddings (RoPE) to handle sequence lengths different from the one it was trained on. The equation is expressed as: This equation ensures that the scaled RoPE transformation () applied with the original positional angle () yields the same result as the original RoPE transformation () applied with a scaled positional angle (). Satisfying this condition preserves the relative positional encoding when extending the context length.

0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Period Matching Equation for RoPE Base Scaling
Equation for Matching Periods in RoPE Base Scaling
A team of engineers is adapting a pre-trained language model to handle much longer text sequences. They decide to use a method that involves scaling the base value used in the model's rotational position embeddings. To select the appropriate scaling factor, they must adhere to a specific guiding principle. Which of the following best describes this principle?
Selecting a RoPE Base Scaling Factor
When adapting a language model for longer sequences by scaling its rotational position embedding base, the guiding constraint is to match the period of the lowest frequency dimension in the scaled model to the period of a model using linear interpolation.
Learn After
Origin of NTK-Aware Scaled RoPE
Formula for Scaled RoPE Frequency Parameters (θ')
An engineer extends the context window of a language model that uses rotary positional embeddings. After modification, they find the model struggles with tasks requiring an understanding of long-range dependencies, as if the relative positioning of distant tokens is lost. Which of the following statements best analyzes the fundamental reason for this failure?
Two engineers are modifying a language model's Rotary Positional Embeddings (RoPE) to handle longer text sequences.
- Engineer A proposes modifying the core RoPE transformation function itself (creating a new function, Ro') while keeping the original positional angles (θ) the same.
- Engineer B proposes keeping the original RoPE transformation function (Ro) unchanged but applying it to a new, scaled set of positional angles (θ').
To ensure that the relative positional information is preserved correctly during this context extension, a key condition must be met: the outcome of the new system must be equivalent to the outcome of the original system applied to scaled positions. Based on this principle, which engineer's approach is more theoretically sound, and why?
Interpreting the RoPE Scaling Condition