1Cademy - A large language model inference system is handling a mix of requests: many short, single-word generation tasks and a few long-input processing tasks. Initially, the system exhibits low overall throughput, with the short tasks experiencing significant delays. A modification is made to the system: instead of processing each long input in one large computational step, it is broken down and processed in a series of smaller, sequential steps. After this change, overall throughput increases and delays for short tasks are reduced. Which statement best analyzes why this modification was effective?

Learn Before

Improved Throughput and Reduced Latency with Chunked Prefilling

Multiple Choice

A large language model inference system is handling a mix of requests: many short, single-word generation tasks and a few long-input processing tasks. Initially, the system exhibits low overall throughput, with the short tasks experiencing significant delays. A modification is made to the system: instead of processing each long input in one large computational step, it is broken down and processed in a series of smaller, sequential steps. After this change, overall throughput increases and delays for short tasks are reduced. Which statement best analyzes why this modification was effective?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related