Learn Before
Deconstructing the High Cost of Autoregressive Decoding
While the sequential, one-by-one token generation is a well-known aspect of the decoding phase in large language models, it is not the sole reason for its high computational expense compared to the prefilling phase. Analyze the underlying factors that make the decoding phase a significant computational bottleneck. In your analysis, differentiate between the impact of memory bandwidth limitations and the nature of the computations being performed.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An AI development team is optimizing their language model's inference speed. They observe that generating a long response token-by-token is significantly more time-consuming than processing the initial user prompt, even when the prompt is long. While the sequential nature of the generation is a factor, which of the following provides the most fundamental explanation for this high computational cost?
Analyzing Inference Performance Bottlenecks
Deconstructing the High Cost of Autoregressive Decoding