Learn Before
In a Transformer decoder, masked self-attention is used to ensure that the prediction for a token at a given position can only depend on previous tokens. This is achieved by modifying the attention score matrix before the softmax function is applied. For a sequence of tokens, which of the following correctly describes the structure of the attention score matrix after this causal mask has been applied?
0
1
Tags
Data Science
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An autoregressive model is generating a sequence of text token by token. When it is time to predict the token at position 't', the model's attention mechanism is designed to calculate relevance scores between the query at position 't' and the keys at all other positions in the sequence. However, a crucial modification is applied that prevents the query at 't' from incorporating information from any keys at positions greater than 't' (i.e., t+1, t+2, etc.). Which statement best analyzes the fundamental reason for this specific modification?
In a Transformer decoder, masked self-attention is used to ensure that the prediction for a token at a given position can only depend on previous tokens. This is achieved by modifying the attention score matrix before the softmax function is applied. For a sequence of tokens, which of the following correctly describes the structure of the attention score matrix after this causal mask has been applied?
A Transformer decoder is calculating its output for a specific token in a sequence. To ensure it only uses information from that token and previous tokens, it employs a special attention mechanism. Arrange the following five operations in the correct chronological order as they would occur within this mechanism.