Learn Before
FIRE
FIRE, introduced by Li et al. (2024b), is a method for calculating positional bias in attention mechanisms. It utilizes a specific functional form that is dependent on the relative distance between token positions.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Related
Interpretation of Positional Bias as a Distance Penalty
T5 Bias for Relative Positional Embedding
Shared Learnable Bias per Offset
Heuristic-Based Relative Positional Biases
Comparison of Learned vs. Heuristic-Based Relative Positional Biases
Kerple
FIRE
Relative Position Offset Calculation
A self-attention model incorporates positional awareness by adding a bias term directly to the query-key dot product for each pair of positions
(i, j). This bias term's value depends on the relative distance betweeniandj. What is the primary implication of this approach compared to the alternative of adding positional vectors to the input token embeddings?Incorporating Positional Bias into Attention Scores
In a self-attention mechanism, the score computed between a query at position
iand a key at positionjis modified by directly adding a bias term whose value depends only on the positionsiandj. What is the primary function of this bias term within the attention calculation?
Learn After
FIRE Positional Bias Formula
A self-attention mechanism is designed so that the positional influence on the attention score between any two tokens depends only on their relative distance, not their absolute locations. For instance, the positional adjustment between the 3rd and 7th tokens is identical to the adjustment between the 23rd and 27th tokens. Which of the following techniques directly implements this principle?
Analyzing the Functional Approach to Positional Bias
An LLM architect is designing a self-attention mechanism where the positional influence between any two tokens is calculated directly as a bias in the attention score. The core design principle is that this bias must be determined by a specific, continuous mathematical function that takes only the relative distance between the tokens as its input. Which of the following implementation strategies directly realizes this design principle?