Learn Before
Rationale for Architectural Changes in Large-Scale Models
A research lab attempts to build a state-of-the-art language model by simply increasing the number of layers and parameters of a well-established, standard neural network design. During training, they observe that the process is highly erratic and frequently collapses, despite using a powerful distributed computing setup. Analyze the underlying reasons why this direct scaling approach often fails and explain the fundamental purpose of introducing deliberate architectural changes to achieve stable training for very large models.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Evaluating a Training Strategy for a New Large Model
Layer Normalization in Transformers
A research team is training a very deep language model based on a standard network design. They observe that as they increase the model's depth, the training process frequently fails with loss values suddenly becoming invalid (NaN). This forces them to restart training repeatedly. Which of the following architectural changes is most specifically designed to mitigate this kind of deep-network training instability?
Rationale for Architectural Changes in Large-Scale Models
Connecting Model Scale and Architectural Design
Omission of Bias Terms in LLM Affine Transformations