Google

The decoding phase in Transformer models is considered a memory-bound operation because it requires frequent access to the Key-Value (KV) cache. This computational bottleneck is exacerbated as the output sequence grows, since the cost of decoding increases significantly with each new token generated.

Decoding Phase as a Memory-Bound Process

A developer is profiling a Transformer-based language model during the generation of a very long text summary. They notice that the latency to produce each new token is not constant; instead, it steadily increases as the summary grows in length. What is the primary reason for this observed slowdown?

Based on your understanding of the computational characteristics of the token generation process, which hardware upgrade option is more likely to solve the team's specific problem? Justify your answer by explaining the primary bottleneck during this phase.

Optimizing Chatbot Latency

During text generation with a Transformer model, the initial processing of the input prompt is often limited by the speed of parallel computations. In contrast, the subsequent, token-by-token generation process is typically limited by a different factor. Explain why this second phase is considered a 'memory-bound' operation and how the size of the generated text impacts its performance.

Learn Before

Related