Based on the scenario described, which computational phase corresponds to the initial burst, and which corresponds to the subsequent sequential generation? Justify your answer by describing the fundamental processing difference between these two phases.

Google

The prefilling and decoding phases of Large Language Model inference differ significantly across several dimensions. While prefilling aims to establish the initial context from the input sequence, decoding focuses on continuing to generate subsequent tokens. In prefilling, tokens are visible all at once and processed in parallel to build an encoded contextual representation. In contrast, decoding operates with sequential visibility, predicting one token at a time using the previously cached key-value pairs. Consequently, prefilling is typically a compute-bound process with a high computational cost, whereas decoding is memory-bound and incurs a very high computational cost as the sequence grows.

Comparison of Prefilling and Decoding Phases

Analyzing Language Model Inference Performance

A user provides a large 2,000-token text to a generative language model and asks for a summary. Which statement best describes how the model initially handles this 2,000-token input *before* it starts generating the summary?

Learn Before

Related