Short Answer

Architectural Analysis for Training Stability

A research team is building a very deep (e.g., 50+ layers) sequence processing model. They find that the training is highly unstable, with gradients often exploding. Their current sub-layer design computes its output by first adding the original input to the result of the sub-layer's main function (e.g., attention), and then applying a normalization step to this combined sum. Analyze why this specific placement of the normalization step is likely contributing to the training instability in such a deep network.

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Related