Learn Before
Computational Cost Comparison: Decoding vs. Prefilling
In most inference scenarios, the decoding phase of a Transformer model incurs a higher computational cost than the prefilling phase. This increased expense is not merely a result of the sequential, token-by-token generation and the repeated updates to the KV cache; other complex factors also contribute to its high resource demand.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Prefilling Phase in Transformer Inference
Computational Cost Comparison: Decoding vs. Prefilling
Decoding Phase in Transformer Inference
Analysis of KV Cache Utilization in Autoregressive Generation
In an autoregressive Transformer model, generating a sequence in response to an input prompt involves two distinct phases from the perspective of the Key-Value (KV) cache. Which option correctly distinguishes the computational activities of these two phases?
An autoregressive language model receives an input prompt and generates a response. From the perspective of how it uses its internal memory for past context (the Key-Value cache), arrange the following high-level stages of the generation process in the correct chronological order.
Learn After
Increased Complexity and Cost from Exploring Multiple Decoding Paths
Inference Performance Bottleneck Analysis
Analysis of Computational Costs in Transformer Inference
Factors Contributing to High Decoding Cost
An engineer observes that generating a 200-token response from a large language model takes significantly more time than processing the initial 200-token input prompt. Which of the following statements provides the most accurate technical explanation for this performance difference?