Google

A basic architecture for a relative positional bias assigns a singular, shared learnable parameter to each distinct query-key distance. Under this framework, the bias value for any query $$\mathbf{q}_i$$ and key $$\mathbf{k}_j$$ relies entirely on their offset, $$i - j$$. Consequently, all pairs sharing this identical offset are mapped to the same variable, $$u_{i-j}$$, resulting in the mathematical relationship: $$\mathrm{PE}(i, j) = u_{i-j}$$.

Shared Learnable Bias per Offset

A major disadvantage of allocating a unique learnable value to every possible sequence offset is that the model becomes rigidly tied to the distances it observed during training. If the architecture processes sequences where the offset $$i - j$$ is greater than the maximum distance encountered in the training phase, it lacks the appropriate learned variables for those extended distances, preventing effective generalization.

Generalization Limit of Offset-Specific Biases

Given the following scenario where a model uses a shared, learnable bias for each relative position offset, determine the correct positional bias value to be used for the specified query-key pair.

Calculating Positional Bias from Offset

In a self-attention mechanism that uses a shared, learnable parameter for each unique relative position offset, which of the following query-key pairs will share the exact same positional bias parameter as the pair with a query at position 8 and a key at position 3?

The T5 bias, introduced by Raffel et al. (2020), is an advanced approach that generalizes the concept of offset-specific biases. To address the generalization problem of assigning a unique parameter to every offset, T5 groups various query-key offsets into a limited number of 'buckets.' Each bucket is then associated with a single, shared learnable parameter, enabling the model to handle a wide range of relative positions, including those not seen during training.

T5 Bias for Relative Positional Embedding

Consider a self-attention model that implements relative positional encoding by assigning a unique, learnable bias parameter to each possible offset between a query and a key. If this model is trained exclusively on sequences with a maximum length of 128 tokens, analyze how many distinct learnable bias parameters are required for this positional encoding scheme. Explain your reasoning.

Learn Before

Related