1Cademy - In a text-processing model, a bias term is added to the attention scores between a query word and a key word before the final attention weights are computed. This bias is calculated as `-β * d`, where `d` is the distance (number of words) between the query and the key, and `β` is a fixed positive number. What is the primary effect of this biasing scheme on the models behavior?

Learn Before

Example of Linear Relative Position Bias Values in Causal Attention

Multiple Choice

In a text-processing model, a bias term is added to the attention scores between a 'query' word and a 'key' word before the final attention weights are computed. This bias is calculated as -β * d, where d is the distance (number of words) between the query and the key, and β is a fixed positive number. What is the primary effect of this biasing scheme on the model's behavior?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related