1Cademy - In a transformer model designed for text generation, a masking mechanism is applied to the attention scores (βi,j) to prevent a token at position `i` from attending to future tokens (positions `j > i`). This is achieved by adding a large negative number (e.g., -∞) to the score before normalization. Consider the calculation of attention scores for a sequence of 4 tokens. Which of the following matrices correctly represents the application of this causal mask, where Score indicates a calculated value and -∞ indicates a masked value?

Learn Before

Attention Score in Transformers ( $\beta_{i,j}$ )

Multiple Choice

In a transformer model designed for text generation, a masking mechanism is applied to the attention scores (βi,j) to prevent a token at position i from attending to future tokens (positions j > i). This is achieved by adding a large negative number (e.g., -∞) to the score before normalization. Consider the calculation of attention scores for a sequence of 4 tokens. Which of the following matrices correctly represents the application of this causal mask, where 'Score' indicates a calculated value and '-∞' indicates a masked value?

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related