1Cademy - A large language model inference system is processing two user requests concurrently. Request 1 has a very long initial prompt that requires significant initial computation. Request 2 is already in the process of generating a response, producing one token at a time. The systems scheduler operates by breaking the initial computation for Request 1 into three smaller chunks. It processes the first chunk of Request 1, then generates one token for Request 2, then processes the second chunk of Request 1, then generates another token for Request 2, and so on. What is the primary advantage of this interleaved processing strategy?

Learn Before

Example of Chunked Prefilling in Iteration-Level Scheduling

Multiple Choice

A large language model inference system is processing two user requests concurrently. Request 1 has a very long initial prompt that requires significant initial computation. Request 2 is already in the process of generating a response, producing one token at a time. The system's scheduler operates by breaking the initial computation for Request 1 into three smaller chunks. It processes the first chunk of Request 1, then generates one token for Request 2, then processes the second chunk of Request 1, then generates another token for Request 2, and so on. What is the primary advantage of this interleaved processing strategy?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related