Learn Before
Numerical Computation Issues in Distributed Training
To guarantee satisfactory results and reliable convergence in distributed training, system design must account for potential numerical computation problems. These challenges are especially pronounced at large scales or when using low-precision arithmetic. Key issues include the non-associativity of floating-point addition, which can affect gradient accumulation, the risk of overflow and underflow errors, and computational inconsistencies that can arise between different hardware devices.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Communication Cost in Distributed Systems
Synchronization Costs in Distributed Systems
Fault Tolerance in Distributed Systems
Additional Scalability Factors in Distributed Training
Numerical Computation Issues in Distributed Training
A research team is training a large model on 128 processing units, and the process takes 10 days. To accelerate the training, they double the number of processing units to 256. However, the new training time is 7 days, not the expected 5 days. Which of the following statements best analyzes this outcome?
Scaling Challenges in LLM Training
Match each distributed training problem scenario with the primary underlying factor that causes it.
Learn After
Low-Precision Arithmetic Challenges in Distributed Training
Impact of Floating-Point Non-Associativity in Gradient Accumulation
Diagnosing Instability in Large-Scale Model Training
A team training a large model on a distributed system notices a peculiar issue. When they perform a gradient accumulation step by summing gradients from all worker nodes, the final aggregated gradient value on the parameter server slightly differs depending on the order in which the gradients arrive and are summed. The team has verified that all worker nodes are using identical hardware and are calculating their individual gradients correctly. Which of the following best explains this phenomenon?
Investigating Training Instability with Mixed-Precision Hardware