Learn Before
Q, K, and V Composition in Transformer Decoding
In the step-by-step decoding process of a Transformer, the self-attention mechanism uses distinct sets of Query (Q), Key (K), and Value (V) vectors. For each new token being generated, a new query vector is created based on that token's embedding. This new query then attends to a cumulative set of key and value vectors. This set is composed of all the key-value pairs from the initial prompt (processed during the prefilling phase) combined with all the key-value pairs from the tokens that have already been generated in previous decoding steps.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Decoding Phase as a Memory-Bound Process
Diagram of the N-th Step in Transformer Decoding
A large language model has processed an initial prompt and has just generated the fifth token of its output. As it prepares to generate the sixth token, which of the following statements most accurately describes the function of the self-attention mechanism in this specific step?
A large language model is generating a response one token at a time after processing the initial prompt. Arrange the following actions in the correct sequence to describe how a single new token is generated.
Q, K, and V Composition in Transformer Decoding
Analyzing a Flawed Decoding Step
Learn After
An autoregressive language model is generating a sequence one token at a time. It has already processed the initial input 'The cat sat on the' and has subsequently generated the tokens 'mat and'. The model is now in the process of generating the token that will follow 'and'. What set of key and value vectors will the new query vector for this step attend to?
Consider a language model generating a sequence of text one token at a time after being given an initial prompt. For the generation of the tenth token in the output sequence, the newly created query vector will attend to a set of key and value vectors derived only from the nine previously generated tokens.
Dynamic K/V Cache in Transformer Decoding