1Cademy - Comparison of Position Offsets in Causal vs. Bidirectional Attention

Learn Before

Unified Formula for T5 Bias Bucketing

Comparison

Comparison of Position Offsets in Causal vs. Bidirectional Attention

The range of the relative position offset, i - j, depends on the type of attention mechanism being used. In causal attention, which is standard for language modeling, a query at position i can only attend to its left-context (positions j where j ≤ i). This constraint ensures the offset i - j is always non-negative. In contrast, general or bidirectional self-attention allows a token to attend to the entire sequence, which includes positions where j > i, thereby permitting negative offsets.