In a common architecture for language model inference, the initial processing of a user's prompt (prefilling) and the subsequent token-by-token generation of the response (decoding) are treated as distinct computational stages, even though they execute on the same hardware. What is the primary analytical reason for this architectural separation?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Continuous Batching for LLM Inference
In a common architecture for language model inference, the initial processing of a user's prompt (prefilling) and the subsequent token-by-token generation of the response (decoding) are treated as distinct computational stages, even though they execute on the same hardware. What is the primary analytical reason for this architectural separation?
Optimizing Inference Throughput
Trade-offs in a Staged Inference Architecture