Learn Before
Conditional Probability Distribution of the Draft Model in Speculative Decoding
In speculative decoding, the draft model, denoted by q, defines a conditional probability distribution for generating the next token. The probability of any candidate token y_{i+t} is conditioned on the original input X, the sequence of already verified tokens Y_{≤i}, and all previously generated draft tokens in the current step, ŷ_{i+1}...ŷ_{i+t-1}. This distribution is formally expressed as Pr_q(y_{i+t} | X, Y_{≤i}, ŷ_{i+1}...ŷ_{i+t-1}).

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Parallel Verification in Speculative Decoding
Mathematical Formulation of Draft Model Prediction in Speculative Decoding
Conditional Probability Distribution of the Draft Model in Speculative Decoding
Evaluation of Draft Tokens by the Verification Model
Structure of the Full Sequence After a Speculative Decoding Step
A text generation system uses two models: a small, fast 'draft' model and a large, accurate 'verification' model to speed up output. Arrange the following events to correctly represent one cycle of this generation process, starting from a given text prefix.
A text generation system uses a fast 'draft' model and a more accurate 'verification' model. The draft model proposes the 4-token sequence:
[jumped, over, the, moon]. The verification model then evaluates this sequence and determines that the first two tokens (jumped,over) are correct, but the third token (the) is incorrect. Based on the rules of this generation algorithm, what is the immediate result of this verification step?Efficiency Limits of a Two-Model Generation System
Diagnosing a Speculative Decoding Slowdown in Production
Choosing τ and Model Roles for Low-Latency Speculative Decoding
Tuning Speculative Decoding Under a Fixed Verification Budget
Interpreting a Speculative Decoding Trace and Identifying the Bottleneck
Explaining a “Fast but Wrong” Speculative Decoding Regression
Root-Causing Low Speedup Despite Parallel Verification
Designing a Speculative Decoding Control Policy for a Latency-Sensitive Product
In a production LLM service using speculative deco...
You are reviewing logs from a production LLM endpo...
You are implementing speculative decoding in a cus...
Learn After
Mathematical Formulation of Draft Model Prediction in Speculative Decoding
Imagine a text generation system where a small, fast model first generates a short sequence of candidate tokens (e.g., C1, C2, C3). Then, a large, accurate model checks all these candidates at once. Let's say the system has already produced a confirmed sequence of tokens:
['The', 'cat', 'sat']. The small model has just generated two candidate tokens in the current step:['on', 'the']. What information does the small model use to calculate the probability distribution for the next candidate token (C3)?In a speculative decoding process, a draft model
qgenerates a sequence of candidate tokens. The probability distribution for thet-th candidate token in the sequence,y_{i+t}, is conditioned on the original inputX, the verified token sequenceY_{≤i}, and one other crucial set of tokens. Complete the formal expression for this conditional probability:Pr_q(y_{i+t} | X, Y_{≤i}, ______).Consider a speculative decoding process where a draft model is generating a sequence of three candidate tokens (ŷ₁, ŷ₂, ŷ₃) after a verified prefix. The probability distribution used to select the third token, ŷ₃, is calculated independently of the first two candidate tokens, ŷ₁ and ŷ₂.