Short Answer

Prioritizing Architectural Modifications for Training Stability

A research team is scaling up a language model and is primarily concerned with preventing training instability, which they've encountered in past projects with very deep networks. They are considering two architectural changes: (A) replacing standard post-layer normalization with pre-layer normalization, or (B) replacing the standard dense feed-forward network with a Mixture-of-Experts (MoE) layer. Which of these two modifications would be a more direct and effective solution for their primary concern? Justify your choice by explaining how your selected modification addresses training instability and why the other option is less suited for this specific goal.

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science