Learn Before
  • Synthesis of T5 Bias Bucketing Rules

Formula for Applying T5 Relative Position Bias

The T5 relative position bias is incorporated directly into the attention score calculation. A learnable scalar bias, denoted as ub(ij)u_{b(i-j)}, is added to the query-key dot product. This sum is then scaled by the inverse square root of the head dimension, dd, before the Softmax function is applied. The specific bias value is determined by the bucket b(ij)b(i-j) that corresponds to the relative offset between the query at position ii and the key at position jj. The complete formula for the attention score α(i,j)\alpha(i, j) is: α(i,j)=Softmax(qiTkj+ub(ij)d+Mask(i,j))\alpha(i, j) = \text{Softmax}\left( \frac{q_i^T k_j + u_{b(i-j)}}{\sqrt{d}} + \text{Mask}(i, j) \right) where Mask(i,j)\text{Mask}(i, j) is the attention mask.

0

1

6 months ago

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • Unified Formula for T5 Bias Bucketing

  • Example of T5 Bias Bucketing

  • Visual Representation of T5 Bias Application (nb=3, distmax=5)

  • A model designer is implementing a mechanism to account for the relative distance between tokens in a sequence. The proposed strategy uses a unique, learnable value for each of the first few relative distances (e.g., 1, 2, 3...), but then groups larger distances into a smaller set of shared values, with the size of these groups increasing as the distance grows. What is the primary trade-off this combined approach is designed to optimize?

  • Analysis of a Hybrid Positional Bucketing System

  • Formula for Applying T5 Relative Position Bias

  • Generalization Advantage of T5 Positional Bias

  • A model uses a hybrid strategy to handle relative positional distances between tokens, assigning each distance to one of a limited number of 'buckets'. The rules are:

    1. For small distances (e.g., 0-15), each distance is assigned to its own unique bucket.
    2. For medium distances, the ranges of distances assigned to a single bucket grow progressively larger as the distance increases.
    3. For very large distances (e.g., beyond 512), all are assigned to a single, final bucket.

    Based on this system, which of the following distances is most likely to be assigned to the same bucket as the distance 40?

Learn After
  • In a standard attention mechanism, an attention score is computed from a query vector (q) and a key vector (k). Consider a modification where a learnable scalar bias is added directly to the query-key dot product before the result is scaled and passed through a Softmax function. The value of this bias is determined solely by the relative distance between the query and key. How does this specific modification influence the attention mechanism's behavior?

  • Calculating T5 Attention Score with Relative Position Bias

  • A researcher implements a modified attention mechanism where the learnable scalar bias, based on relative position, is applied after the query-key dot product is scaled. The formula used is: α(i,j)=Softmax(qiTkjd+ub(ij)+Mask(i,j))\alpha(i, j) = \text{Softmax}\left( \frac{q_i^T k_j}{\sqrt{d}} + u_{b(i-j)} + \text{Mask}(i, j) \right) What is the most significant consequence of this specific modification compared to the standard approach of adding the bias before scaling?