1Cademy - A researcher implements a modified attention mechanism where the learnable scalar bias, based on relative position, is applied *after* the query-key dot product is scaled. The formula used is: $$\alpha(i, j) = \text{Softmax}\left( \frac{q_i^T k_j}{\sqrt{d}} + u_{b(i-j)} + \text{Mask}(i, j) \right)$$ What is the most significant consequence of this specific modification compared to the standard approach of adding the bias *before* scaling?

Learn Before

Formula for Applying T5 Relative Position Bias

Multiple Choice

A researcher implements a modified attention mechanism where the learnable scalar bias, based on relative position, is applied after the query-key dot product is scaled. The formula used is: $\alpha(i, j) = \text{Softmax}\left( \frac{q_i^T k_j}{\sqrt{d}} + u_{b(i-j)} + \text{Mask}(i, j) \right)$ What is the most significant consequence of this specific modification compared to the standard approach of adding the bias before scaling?

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related