Diagnosing a Scalability Bottleneck in a Training Cluster
A team is training a large model on a cluster of 8 machines, each with a powerful processing unit. They observe that the training speed does not increase linearly as they add more machines. Using monitoring tools, they notice that at the end of each training step, 2 of the 8 machines consistently finish their assigned computation and data processing tasks significantly later than the other 6. The overall training step can only complete after these 2 slower machines are finished, leaving the other 6 machines idle for a noticeable period. Based on this specific observation, which of the following factors is the most critical bottleneck preventing the system from scaling effectively? Justify your choice.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Diagnosing a Scalability Bottleneck in a Training Cluster
A distributed training system for a large model uses an efficient parallelism strategy across multiple nodes. However, monitoring tools reveal that the GPUs are consistently operating at only 40% utilization, significantly hindering overall training speed. Which of the following adjustments is most likely to address this specific performance bottleneck?
Analyzing Scalability Trade-offs in Distributed Training