Activity (Process)

Diagram of the Decoding Phase

The decoding phase of a Transformer, as illustrated in its diagram, operates sequentially to generate one token at a time. In this step-by-step process, the model uses the token from the previous step as input to an embedding layer, which then generates a new query vector. This query attends to an expanding set of keys and values, comprising those from the initial prompt (prefilling phase) and all previously generated tokens. The output from this self-attention mechanism is processed by a Softmax layer to calculate the conditional probability for the next token, such as Pr(yn|x, y<n). This autoregressive cycle is repeated for each new token in the output sequence.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related