Learn Before
Comparison of Position Offsets in Causal vs. Bidirectional Attention
The range of the relative position offset, i - j, depends on the type of attention mechanism being used. In causal attention, which is standard for language modeling, a query at position i can only attend to its left-context (positions j where j ≤ i). This constraint ensures the offset i - j is always non-negative. In contrast, general or bidirectional self-attention allows a token to attend to the entire sequence, which includes positions where j > i, thereby permitting negative offsets.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Comparison of Position Offsets in Causal vs. Bidirectional Attention
Calculating a Relative Position Bias Bucket
The T5 relative position bias bucketing formula is a piecewise function, treating small and large relative position offsets differently. For small offsets, it uses a direct one-to-one mapping to a bucket. For larger offsets, it transitions to a logarithmic mapping. What is the primary rationale behind this dual-strategy design?
A key characteristic of the T5 relative position bias bucketing formula is that it maintains a consistent level of positional precision regardless of the distance between tokens. For example, the distinction it makes between relative positions 10 and 20 is just as fine-grained as the distinction it makes between relative positions 500 and 510.
Visualization of T5 Bias Bucketing
Learn After
An engineer is inspecting a self-attention layer and observes that for a given query token, the set of calculated relative position offsets (
query_position - key_position) includes both positive and negative values. What can be concluded about the nature of this attention mechanism?In a self-attention mechanism designed for a machine translation encoder, which processes an entire source sentence at once, the relative position offset between a query at position
iand a key at positionj(calculated asi - j) must always be greater than or equal to zero.Choosing an Attention Mechanism for a Language Task