1Cademy - Diagnosing a Scalability Bottleneck in a Training Cluster

Learn Before

Additional Scalability Factors in Distributed Training

Case Study

Diagnosing a Scalability Bottleneck in a Training Cluster

A team is training a large model on a cluster of 8 machines, each with a powerful processing unit. They observe that the training speed does not increase linearly as they add more machines. Using monitoring tools, they notice that at the end of each training step, 2 of the 8 machines consistently finish their assigned computation and data processing tasks significantly later than the other 6. The overall training step can only complete after these 2 slower machines are finished, leaving the other 6 machines idle for a noticeable period. Based on this specific observation, which of the following factors is the most critical bottleneck preventing the system from scaling effectively? Justify your choice.

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related