Learn Before
Diagnosing Training Inefficiency
A machine learning team is training a large model partitioned across four accelerators, where each accelerator holds a different sequential segment of the model. They notice that their monitoring tools show a 'bubble' of inactivity that propagates through the accelerators; only one device is active at any given time during a forward or backward pass, leading to poor overall hardware utilization. What specific type of parallelism is designed to solve this exact problem, and how does it achieve better hardware utilization?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Data Parallelism
Model Parallelism
Pipeline Parallelism
A research team is developing a novel language model with several trillion parameters. During the initial training setup, they discover that the model is too large to fit into the memory of a single available accelerator (e.g., a GPU). Which parallelism strategy is specifically designed to address this fundamental constraint?
Match each parallelism strategy with the description that best defines its core mechanism for distributing the training workload.
Diagnosing Training Inefficiency