Trade-offs in a Staged Inference Architecture
An inference system processes user prompts by first computing the initial state for a whole batch of requests (the 'prefill' stage), and only then proceeds to generate responses token-by-token for that same batch (the 'decoding' stage). Describe one major efficiency benefit and one potential drawback of this approach where the entire batch must complete the first stage before the second stage begins.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Continuous Batching for LLM Inference
In a common architecture for language model inference, the initial processing of a user's prompt (prefilling) and the subsequent token-by-token generation of the response (decoding) are treated as distinct computational stages, even though they execute on the same hardware. What is the primary analytical reason for this architectural separation?
Optimizing Inference Throughput
Trade-offs in a Staged Inference Architecture