Case Study

Diagnosing Performance Bottlenecks in Autoregressive Generation

A development team is using a large, pre-trained language model for a real-time, multi-turn conversational agent. They observe that while the initial response to a user's first message is fast, the time it takes to generate each subsequent response in the same conversation increases progressively. System monitoring reveals that the memory allocated for the ongoing conversation grows linearly with the length of the conversation history. The team has confirmed this is not a network or server load issue. Based on the typical step-by-step (autoregressive) generation process, what specific data structure associated with the self-attention mechanism is the most likely cause of both the increasing latency and growing memory footprint? Explain the connection.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science