Learn Before
Prioritizing Architectural Modifications for Training Stability
A research team is scaling up a language model and is primarily concerned with preventing training instability, which they've encountered in past projects with very deep networks. They are considering two architectural changes: (A) replacing standard post-layer normalization with pre-layer normalization, or (B) replacing the standard dense feed-forward network with a Mixture-of-Experts (MoE) layer. Which of these two modifications would be a more direct and effective solution for their primary concern? Justify your choice by explaining how your selected modification addresses training instability and why the other option is less suited for this specific goal.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Diagnosing Instability in Large-Scale Model Training
A team is training an exceptionally deep transformer-based language model and observes that the training process is highly unstable, with loss values fluctuating wildly and sometimes resulting in non-numeric values (NaNs). This suggests that the gradients are either exploding or vanishing as they propagate through the numerous layers. Which of the following architectural modifications is most specifically designed to address this type of instability in very deep networks?
Prioritizing Architectural Modifications for Training Stability