1Cademy - A developer is implementing an attention layer for a model that incorporates positional information by adding a learnable scalar bias based on the relative distance between tokens. Given a query vector `q_i` for a token at position `i`, a key vector `k_j` for a token at position `j`, a key dimension `d_k`, and the specific learnable bias `u_{b(i-j)}` for their relative position, which of the following expressions correctly computes the unnormalized attention score (the value passed into the softmax function) for this architectural design?

Learn Before

Formula for Attention with T5 Bias (Unscaled)

Multiple Choice

A developer is implementing an attention layer for a model that incorporates positional information by adding a learnable scalar bias based on the relative distance between tokens. Given a query vector q_i for a token at position i, a key vector k_j for a token at position j, a key dimension d_k, and the specific learnable bias u_{b(i-j)} for their relative position, which of the following expressions correctly computes the unnormalized attention score (the value passed into the softmax function) for this architectural design?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related