Inference Performance Bottleneck Analysis
Based on the principles of autoregressive generation in these models, analyze the performance observation described in the case study. Identify the primary computational bottleneck and explain the core reason for its disproportionately high resource demand.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Increased Complexity and Cost from Exploring Multiple Decoding Paths
Inference Performance Bottleneck Analysis
Analysis of Computational Costs in Transformer Inference
Factors Contributing to High Decoding Cost
An engineer observes that generating a 200-token response from a large language model takes significantly more time than processing the initial 200-token input prompt. Which of the following statements provides the most accurate technical explanation for this performance difference?