1Cademy - A deep sequence model is constructed by stacking multiple layers. Each layer consists of two sub-layers (e.g., a self-attention mechanism and a feed-forward network). A post-norm architecture is used for each sub-layer, which involves applying the sub-layers main function, adding a residual connection from the input, and then performing layer normalization. If `x` represents the input to a sub-layer and `F(x)` represents the output of that sub-layers main function, which of the following expressions correctly computes the final output of that sub-layer?

Learn Before

BERT's Core Architecture

Multiple Choice

A deep sequence model is constructed by stacking multiple layers. Each layer consists of two sub-layers (e.g., a self-attention mechanism and a feed-forward network). A 'post-norm' architecture is used for each sub-layer, which involves applying the sub-layer's main function, adding a residual connection from the input, and then performing layer normalization. If x represents the input to a sub-layer and F(x) represents the output of that sub-layer's main function, which of the following expressions correctly computes the final output of that sub-layer?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related