Case Study

Analyzing Inference Performance Bottlenecks

An ML engineering team is profiling their new large language model. They observe the following performance characteristics for a single request: The prefill phase, which processes a 512-token prompt, completes in 200 milliseconds. The decoding phase, which generates a 128-token response, takes 1200 milliseconds. Based on these metrics, analyze the fundamental difference in computational bottlenecks between the two phases that explains why generating a much shorter sequence takes six times longer than processing a long one.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science