Short Answer

Pipelined Engine Efficiency

An LLM inference system uses two separate engines in a pipeline: Engine 1 for processing initial prompts (prefilling) and Engine 2 for generating subsequent tokens (decoding). When a continuous stream of request batches arrives, explain precisely when Engine 1 can start processing a new batch (e.g., Batch B) in relation to the processing of the previous batch (Batch A). Why is this timing crucial for maximizing hardware utilization?

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science