Learn Before
  • Diagram of the Decoding Phase

Decoding Phase as a Memory-Bound Process

The decoding phase in Transformer models is considered a memory-bound operation because it requires frequent access to the Key-Value (KV) cache. This computational bottleneck is exacerbated as the output sequence grows, since the cost of decoding increases significantly with each new token generated.

0

1

6 months ago

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • Decoding Phase as a Memory-Bound Process

  • Diagram of the N-th Step in Transformer Decoding

  • A large language model has processed an initial prompt and has just generated the fifth token of its output. As it prepares to generate the sixth token, which of the following statements most accurately describes the function of the self-attention mechanism in this specific step?

  • A large language model is generating a response one token at a time after processing the initial prompt. Arrange the following actions in the correct sequence to describe how a single new token is generated.

  • Q, K, and V Composition in Transformer Decoding

  • Analyzing a Flawed Decoding Step

Learn After
  • A developer is profiling a Transformer-based language model during the generation of a very long text summary. They notice that the latency to produce each new token is not constant; instead, it steadily increases as the summary grows in length. What is the primary reason for this observed slowdown?

  • Optimizing Chatbot Latency

  • Computational Bottleneck in Token Generation