Learn Before
Token Prediction within the Prefilling Phase
The prefilling phase involves a parallel computation where the entire input sequence is processed at once to generate the KV cache. A key outcome of this process is the determination of the probability distribution for the first output token. Furthermore, in certain scenarios, this phase can extend to predict subsequent tokens, such as the second output token.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Related
Self-Attention Formula for the Prefilling Phase
Prefilling as a Compute-Bound Process
Token Prediction within the Prefilling Phase
When a large language model first processes a user's prompt, it can perform calculations for all words in the prompt simultaneously rather than one by one. What is the fundamental condition that makes this highly parallel approach possible during this initial stage?
LLM Inference Performance Analysis
Rationale for Parallelism in Initial Prompt Processing
Diagram of the Prefilling Phase
Learn After
When a large language model is given a long text prompt, the initial processing phase computes representations for all input tokens simultaneously. Considering this highly parallel approach, what is the most direct and immediate outcome related to generating the first piece of the model's response?
Post-Prefilling State Analysis
During the initial processing of an input sequence, the parallel computation phase is strictly limited to generating the probability distribution for only the single, first output token.