1Cademy - Diagnosing Instability in Large-Scale Model Training

Learn Before

Numerical Computation Issues in Distributed Training

Essay

Diagnosing Instability in Large-Scale Model Training

A machine learning team is training a very large model on a cluster of hundreds of processing units. To improve efficiency, they perform calculations using a low-precision 16-bit number format. The training process involves calculating updates on small data batches on each unit and then summing these updates together. The team observes two problems: 1) The model's performance metrics begin to diverge slightly across different groups of processing units, even when using identical configurations and data. 2) The training process occasionally halts because key values become invalid (e.g., 'Not a Number'). Analyze this scenario and identify the two most likely numerical computation issues causing these problems. For each issue, explain how it leads to one of the observed problems.

0

1

Updated 2025-10-01

Contributors are:

Who are from:

Learn Before

Related