1Cademy - An autoregressive model is generating a sequence of text token by token. When it is time to predict the token at position t, the models attention mechanism is designed to calculate relevance scores between the query at position t and the keys at all other positions in the sequence. However, a crucial modification is applied that prevents the query at t from incorporating information from any keys at positions greater than t (i.e., t+1, t+2, etc.). Which statement best analyzes the fundamental reason for this specific modification?

Learn Before

Masked Self-Attention in Transformer Decoders

Multiple Choice

An autoregressive model is generating a sequence of text token by token. When it is time to predict the token at position 't', the model's attention mechanism is designed to calculate relevance scores between the query at position 't' and the keys at all other positions in the sequence. However, a crucial modification is applied that prevents the query at 't' from incorporating information from any keys at positions greater than 't' (i.e., t+1, t+2, etc.). Which statement best analyzes the fundamental reason for this specific modification?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related