In a self-attention mechanism where the prediction for a token at position i can only depend on tokens from positions 0 up to and including i, what is the total number of query-key dot products computed for an entire input sequence of 5 tokens (indexed 0 to 4)?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Explicit Enumeration of Causal Self-Attention Dot Products
An autoregressive model processes a sequence of tokens, where the query for a given token
i(denoted ) can only interact with key vectors from positionsjwherej ≤ i. For the 4th token in a sequence (indexed as ), which of the following dot product computations would not be performed?In a self-attention mechanism where the prediction for a token at position
ican only depend on tokens from positions 0 up to and includingi, what is the total number of query-key dot products computed for an entire input sequence of 5 tokens (indexed 0 to 4)?An autoregressive model processes a sequence of tokens, where the query for a given token
i(denoted as ) can only interact with key vectors from positionsjwherej ≤ i. Match each query vector with the complete set of dot products it computes.