Learn Before
In a self-attention mechanism designed for generating text one token at a time, the calculation for a token at a specific position must only depend on the tokens that came before it and the token at the current position. For a sequence of 5 tokens (indexed 0 to 4), which of the following dot product calculations between a query vector (q) and a key vector (k) would be disallowed to maintain this property?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In a self-attention mechanism designed for generating text one token at a time, the calculation for a token at a specific position must only depend on the tokens that came before it and the token at the current position. For a sequence of 5 tokens (indexed 0 to 4), which of the following dot product calculations between a query vector (q) and a key vector (k) would be disallowed to maintain this property?
In a self-attention mechanism where the output for any given position can only depend on inputs at the current and preceding positions, consider a sequence of 8 tokens (indexed 0 to 7). The query vector for the final token in the sequence will be multiplied with a total of ___ key vectors.
In a language model that generates text sequentially, the attention mechanism ensures that the prediction for a token only depends on the tokens that have come before it, including itself. For a sequence of 6 tokens (indexed 0 to 5), which of the following lists represents the complete set of dot products that must be computed for the query vector at position 3 (q₃)?