Essay

Choosing τ and Model Roles for Low-Latency Speculative Decoding

You are deploying speculative decoding for a customer-facing chat product with a strict p95 latency SLO. The system uses a small, fast draft model to propose τ tokens autoregressively, then a large verification model to evaluate all τ proposed tokens in one parallel forward pass. After verification, only the maximum consecutively accepted prefix of the τ tokens is appended; at the first rejected token, the remaining draft tokens are discarded and the verification model generates the next token to continue.

In a recent A/B test, increasing τ from 4 to 16 reduced the number of verification forward passes per response, but p95 latency got worse and output quality became less stable (more abrupt shifts in tone mid-sentence). Write a recommendation memo that (1) explains, using the interaction between the draft model, the verification model, parallel verification, and the “maximum consecutively accepted tokens” rule, how a larger τ can simultaneously reduce verification-call count yet worsen tail latency and perceived quality; and (2) proposes a concrete policy for choosing τ (or adapting it online) that explicitly accounts for draft accuracy, the cost of a verification pass, and the expected consecutively accepted prefix length. Your memo should make clear what signals you would monitor in production and what trade-offs your policy is optimizing.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related