Mathematical Formulation of Draft Model Prediction in Speculative Decoding
In speculative decoding, the draft model prediction phase starts with a given prefix, denoted as . The draft model is used to predict the next consecutive tokens, represented as . This generation is a token-by-token process where each new token is chosen by greedily selecting the one with the highest probability according to the draft model's distribution , conditioned on the prefix and all previously generated draft tokens. This is formally expressed as: .

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Token Selection from Probability Distribution
Step-by-Step Example of Auto-Regressive Sequence Generation
Mathematical Formulation of Draft Model Prediction in Speculative Decoding
Iterative Context Update in Autoregressive Generation
Key-Value (KV) Cache in Transformer Inference
Sequential Generation of Output Tokens
Context Shifting in Auto-Regressive Generation
A language model is generating a sentence and has so far produced the sequence:
['The', 'cat', 'sat']. Based on the principles of sequential, one-at-a-time token generation where each new token depends on the ones before it, what is the direct input the model will use to determine the next token in the sequence?A language model generates text by producing a single token at each step, using the entire sequence generated so far as the context for the next token. Arrange the following events in the correct chronological order to illustrate the generation of two new tokens following the initial input 'The ocean is'.
A researcher develops a novel text generation model. Given an input like 'The movie was', instead of generating one token at a time, this model predicts the entire completion (e.g., 'incredibly boring and predictable') in a single, parallel step. Which core principle of the standard auto-regressive process is fundamentally violated by this new model's design?
Mathematical Formulation of Draft Model Prediction in Speculative Decoding
Mathematical Formulation of Draft Model Prediction in Speculative Decoding
Imagine a text generation system where a small, fast model first generates a short sequence of candidate tokens (e.g., C1, C2, C3). Then, a large, accurate model checks all these candidates at once. Let's say the system has already produced a confirmed sequence of tokens:
['The', 'cat', 'sat']. The small model has just generated two candidate tokens in the current step:['on', 'the']. What information does the small model use to calculate the probability distribution for the next candidate token (C3)?In a speculative decoding process, a draft model
qgenerates a sequence of candidate tokens. The probability distribution for thet-th candidate token in the sequence,y_{i+t}, is conditioned on the original inputX, the verified token sequenceY_{≤i}, and one other crucial set of tokens. Complete the formal expression for this conditional probability:Pr_q(y_{i+t} | X, Y_{≤i}, ______).Consider a speculative decoding process where a draft model is generating a sequence of three candidate tokens (ŷ₁, ŷ₂, ŷ₃) after a verified prefix. The probability distribution used to select the third token, ŷ₃, is calculated independently of the first two candidate tokens, ŷ₁ and ŷ₂.
Parallel Verification in Speculative Decoding
Mathematical Formulation of Draft Model Prediction in Speculative Decoding
Conditional Probability Distribution of the Draft Model in Speculative Decoding
Evaluation of Draft Tokens by the Verification Model
Structure of the Full Sequence After a Speculative Decoding Step
A text generation system uses two models: a small, fast 'draft' model and a large, accurate 'verification' model to speed up output. Arrange the following events to correctly represent one cycle of this generation process, starting from a given text prefix.
A text generation system uses a fast 'draft' model and a more accurate 'verification' model. The draft model proposes the 4-token sequence:
[jumped, over, the, moon]. The verification model then evaluates this sequence and determines that the first two tokens (jumped,over) are correct, but the third token (the) is incorrect. Based on the rules of this generation algorithm, what is the immediate result of this verification step?Efficiency Limits of a Two-Model Generation System
Diagnosing a Speculative Decoding Slowdown in Production
Choosing τ and Model Roles for Low-Latency Speculative Decoding
Tuning Speculative Decoding Under a Fixed Verification Budget
Interpreting a Speculative Decoding Trace and Identifying the Bottleneck
Explaining a “Fast but Wrong” Speculative Decoding Regression
Root-Causing Low Speedup Despite Parallel Verification
Designing a Speculative Decoding Control Policy for a Latency-Sensitive Product
In a production LLM service using speculative deco...
You are reviewing logs from a production LLM endpo...
You are implementing speculative decoding in a cus...
Learn After
Example of Draft Token Generation in Speculative Decoding
A system uses a fast draft model to autoregressively generate a sequence of several candidate tokens from a given prefix. The model generates these candidates one by one, and for each step, it greedily selects the token with the highest probability according to its own distribution,
Pr_q. If the system is in the process of generating the third candidate token in the sequence,ŷ_{i+3}, which of the following represents the correct set of information the draft model's probability distribution must be conditioned on for this specific step?A developer is implementing the draft token generation phase of a text generation system. The system is designed to autoregressively produce a short sequence of candidate tokens at each step. The developer's code for generating the third token in a sequence,
ŷ_{i+3}, incorrectly conditions the draft model's probability distribution only on the initial prefix[X, y_{≤i}]and the first candidate tokenŷ_{i+1}, omitting the second candidate tokenŷ_{i+2}from the context. What is the most likely consequence of this specific error?A fast, approximate language model is tasked with generating a sequence of three candidate tokens (ŷᵢ₊₁, ŷᵢ₊₂, ŷᵢ₊₃) starting from a given text prefix P. Arrange the following actions in the correct chronological order to describe how this sequence is produced.