Diagram of the Decoding Phase
The decoding phase of a Transformer, as illustrated in its diagram, operates sequentially to generate one token at a time. In this step-by-step process, the model uses the token from the previous step as input to an embedding layer, which then generates a new query vector. This query attends to an expanding set of keys and values, comprising those from the initial prompt (prefilling phase) and all previously generated tokens. The output from this self-attention mechanism is processed by a Softmax layer to calculate the conditional probability for the next token, such as Pr(yn|x, y<n). This autoregressive cycle is repeated for each new token in the output sequence.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Logits in Transformer Language Models
Final Hidden States in a Transformer Language Model
Next-Token Probability Calculation in Autoregressive Decoders
Diagram of the Decoding Phase
Diagram of the Transformer Language Model Forward Pass
Diagram of the Autoregressive Generation Architectural Flow
A decoder-only language model generates text one token at a time in a step-by-step process. Arrange the following steps in the correct chronological order for generating a single new token, given an initial prompt and any previously generated tokens.
In the step-by-step generation process of a decoder-only language model, consider a hypothetical modification at generation step
i. Instead of using the initial prompt combined with all previously generated tokens as input, the model is only given the initial prompt. What is the most likely consequence of this change on the generated text?Diagnosing a Generation Failure in a Decoder-Only Model
Diagram of the Decoding Phase
Single-Step Generation with a KV Cache
Comparison of Prefilling and Decoding Phases
Disaggregation of Prefilling and Decoding using Pipelined Engines
After a large language model processes an initial prompt, it enters a generation stage where it produces the output sequence one token at a time. In each step of this stage, a new query vector is generated for the current position, and it must perform an attention operation over the key-value pairs of the initial prompt plus all the key-value pairs of the tokens generated in previous steps. As the output sequence gets longer, what becomes the most significant performance bottleneck for generating each new token?
A large language model has finished processing an initial prompt and is about to generate the first token of its response. Arrange the following events in the correct chronological order for this single generation step.
Evaluating an Inference Optimization Proposal
You run an internal LLM inference service for empl...
You’re on-call for an internal LLM chat service. M...
You operate a GPU-backed LLM service that uses con...
Your company’s internal LLM service handles many c...
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service
Decoding Phase Goal Formula
Learn After
Decoding Phase as a Memory-Bound Process
Diagram of the N-th Step in Transformer Decoding
A large language model has processed an initial prompt and has just generated the fifth token of its output. As it prepares to generate the sixth token, which of the following statements most accurately describes the function of the self-attention mechanism in this specific step?
A large language model is generating a response one token at a time after processing the initial prompt. Arrange the following actions in the correct sequence to describe how a single new token is generated.
Q, K, and V Composition in Transformer Decoding
Analyzing a Flawed Decoding Step