Learn Before
  • Kerple

Formula for Relative Position Scaled by Sinusoidal Wavelength

This formula calculates a value based on the relative distance (i - j) between two sequence positions. This distance is then scaled by a denominator, $10000^{2k/d}$, which serves as a wavelength term and is a core component of the sinusoidal positional encoding scheme from the original Transformer model. The full formula is: (ij)/100002k/d(i-j)/10000^{2k/d} In this expression, k typically represents the dimension index within the embedding, and d is the total dimensionality of the model's embeddings. The resulting value is commonly used as an input to sine and cosine functions to generate a final positional encoding vector or bias.

Image 0

0

1

15 days ago

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • Kerple Positional Bias Formula

  • Kerple Logarithmic Bias Formula

  • Sandwich Method (Chi et al., 2023)

  • Formula for Relative Position Scaled by Sinusoidal Wavelength

  • A transformer model incorporates a positional bias mechanism where a penalty is applied to the attention score between a query and a key. This penalty grows larger as the distance between the query's position and the key's position in the sequence increases. Given the sentence 'The quick brown fox jumps over the lazy dog', which of the following query-key pairs would receive the smallest penalty from this mechanism?

  • Comparing Positional Bias Functions

  • A self-attention mechanism is modified to include a bias term that systematically penalizes attention scores between pairs of tokens. The magnitude of this penalty increases as the distance between the tokens' positions in the sequence grows. For which of the following tasks would this modification be most likely to hinder the model's performance?

Learn After
  • In the formula (i - j) / 10000^(2k/d), used for calculating a scaled relative position, k represents a specific dimension index within a d-dimensional embedding. Analyze the relationship between the dimension index k and the denominator 10000^(2k/d). What is the effect of this relationship on the resulting positional signal?

  • Debugging a Positional Encoding Implementation

  • Analysis of Wavelength Variation