Diagnosing Instability in Large-Scale Model Training
A machine learning team is training a very large model on a cluster of hundreds of processing units. To improve efficiency, they perform calculations using a low-precision 16-bit number format. The training process involves calculating updates on small data batches on each unit and then summing these updates together. The team observes two problems: 1) The model's performance metrics begin to diverge slightly across different groups of processing units, even when using identical configurations and data. 2) The training process occasionally halts because key values become invalid (e.g., 'Not a Number'). Analyze this scenario and identify the two most likely numerical computation issues causing these problems. For each issue, explain how it leads to one of the observed problems.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Low-Precision Arithmetic Challenges in Distributed Training
Impact of Floating-Point Non-Associativity in Gradient Accumulation
Diagnosing Instability in Large-Scale Model Training
A team training a large model on a distributed system notices a peculiar issue. When they perform a gradient accumulation step by summing gradients from all worker nodes, the final aggregated gradient value on the parameter server slightly differs depending on the order in which the gradients arrive and are summed. The team has verified that all worker nodes are using identical hardware and are calculating their individual gradients correctly. Which of the following best explains this phenomenon?
Investigating Training Instability with Mixed-Precision Hardware