Optimizing Inference Throughput
Based on the architectural principle of separating the distinct computational phases of inference, propose a change to the team's batch processing logic to improve GPU utilization. Explain why your proposed change would be effective.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Continuous Batching for LLM Inference
In a common architecture for language model inference, the initial processing of a user's prompt (prefilling) and the subsequent token-by-token generation of the response (decoding) are treated as distinct computational stages, even though they execute on the same hardware. What is the primary analytical reason for this architectural separation?
Optimizing Inference Throughput
Trade-offs in a Staged Inference Architecture