Learn Before
Self-Attention as a Source of Inference Difficulty in Transformers
The self-attention mechanism, a core component of Transformer models, is a significant contributor to the challenges and complexity encountered during the inference process.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Learn After
A team is deploying a large language model to generate chapter-length summaries of scientific papers. They observe that the time required to generate a summary increases dramatically with the length of the input paper, and the process often fails due to 'out of memory' errors on their hardware, even when processing one paper at a time. Which component of the model's architecture is the most direct cause of this specific performance scaling issue?
Computational Bottlenecks in Autoregressive Generation
Diagnosing Performance Bottlenecks in Autoregressive Generation