1Cademy - In a language model using the complete ALiBi attention formula for causal text generation, the model needs to prevent a query token at position `i` from attending to any key token at a future position `j` (where `j > i`). How does the `Mask(i, j)` term within the formula `α(i, j) = Softmax((q_iᵀk_j + β⋅(j-i))/√d + Mask(i, j))` achieve this?

Learn Before

Complete ALiBi Attention Formula

Multiple Choice

In a language model using the complete ALiBi attention formula for causal text generation, the model needs to prevent a query token at position i from attending to any key token at a future position j (where j > i). How does the Mask(i, j) term within the formula α(i, j) = Softmax((q_iᵀk_j + β⋅(j-i))/√d + Mask(i, j)) achieve this?

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related