Essay

Diagnosing Instability in Large-Scale Model Training

A machine learning team is training a very large model on a cluster of hundreds of processing units. To improve efficiency, they perform calculations using a low-precision 16-bit number format. The training process involves calculating updates on small data batches on each unit and then summing these updates together. The team observes two problems: 1) The model's performance metrics begin to diverge slightly across different groups of processing units, even when using identical configurations and data. 2) The training process occasionally halts because key values become invalid (e.g., 'Not a Number'). Analyze this scenario and identify the two most likely numerical computation issues causing these problems. For each issue, explain how it leads to one of the observed problems.

0

1

Updated 2025-10-01

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science