Learn Before
Computational Bottleneck in Token Generation
During text generation with a Transformer model, the initial processing of the input prompt is often limited by the speed of parallel computations. In contrast, the subsequent, token-by-token generation process is typically limited by a different factor. Explain why this second phase is considered a 'memory-bound' operation and how the size of the generated text impacts its performance.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A developer is profiling a Transformer-based language model during the generation of a very long text summary. They notice that the latency to produce each new token is not constant; instead, it steadily increases as the summary grows in length. What is the primary reason for this observed slowdown?
Optimizing Chatbot Latency
Computational Bottleneck in Token Generation