Diagnosing Performance Bottlenecks in Autoregressive Generation
A development team is using a large, pre-trained language model for a real-time, multi-turn conversational agent. They observe that while the initial response to a user's first message is fast, the time it takes to generate each subsequent response in the same conversation increases progressively. System monitoring reveals that the memory allocated for the ongoing conversation grows linearly with the length of the conversation history. The team has confirmed this is not a network or server load issue. Based on the typical step-by-step (autoregressive) generation process, what specific data structure associated with the self-attention mechanism is the most likely cause of both the increasing latency and growing memory footprint? Explain the connection.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A team is deploying a large language model to generate chapter-length summaries of scientific papers. They observe that the time required to generate a summary increases dramatically with the length of the input paper, and the process often fails due to 'out of memory' errors on their hardware, even when processing one paper at a time. Which component of the model's architecture is the most direct cause of this specific performance scaling issue?
Computational Bottlenecks in Autoregressive Generation
Diagnosing Performance Bottlenecks in Autoregressive Generation