Learn Before
Analyzing Inference Performance Bottlenecks
An ML engineering team is profiling their new large language model. They observe the following performance characteristics for a single request: The prefill phase, which processes a 512-token prompt, completes in 200 milliseconds. The decoding phase, which generates a 128-token response, takes 1200 milliseconds. Based on these metrics, analyze the fundamental difference in computational bottlenecks between the two phases that explains why generating a much shorter sequence takes six times longer than processing a long one.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An AI development team is optimizing their language model's inference speed. They observe that generating a long response token-by-token is significantly more time-consuming than processing the initial user prompt, even when the prompt is long. While the sequential nature of the generation is a factor, which of the following provides the most fundamental explanation for this high computational cost?
Analyzing Inference Performance Bottlenecks
Deconstructing the High Cost of Autoregressive Decoding