Case Study

Diagnosing a Scalability Bottleneck in a Training Cluster

A team is training a large model on a cluster of 8 machines, each with a powerful processing unit. They observe that the training speed does not increase linearly as they add more machines. Using monitoring tools, they notice that at the end of each training step, 2 of the 8 machines consistently finish their assigned computation and data processing tasks significantly later than the other 6. The overall training step can only complete after these 2 slower machines are finished, leaving the other 6 machines idle for a noticeable period. Based on this specific observation, which of the following factors is the most critical bottleneck preventing the system from scaling effectively? Justify your choice.

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science