Short Answer

Diagnosing Training Inefficiency

A machine learning team is training a large model partitioned across four accelerators, where each accelerator holds a different sequential segment of the model. They notice that their monitoring tools show a 'bubble' of inactivity that propagates through the accelerators; only one device is active at any given time during a forward or backward pass, leading to poor overall hardware utilization. What specific type of parallelism is designed to solve this exact problem, and how does it achieve better hardware utilization?

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science