1Cademy - An engineer is training a very deep sequence-processing model and observes that the gradients are becoming unstable, causing the training to fail. The current architecture of each sub-layer in the model computes its output using the formula: `output = Normalize(input + Function(input))`. Which of the following modifications to the sub-layers computational flow is most likely to resolve the instability issue by ensuring a cleaner information flow through the residual connections?

Learn Before

Pre-Norm Architecture in Transformers

Multiple Choice

An engineer is training a very deep sequence-processing model and observes that the gradients are becoming unstable, causing the training to fail. The current architecture of each sub-layer in the model computes its output using the formula: output = Normalize(input + Function(input)). Which of the following modifications to the sub-layer's computational flow is most likely to resolve the instability issue by ensuring a cleaner information flow through the residual connections?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related