Learn Before
Multiple Choice

A deep sequence model is constructed by stacking multiple layers. Each layer consists of two sub-layers (e.g., a self-attention mechanism and a feed-forward network). A 'post-norm' architecture is used for each sub-layer, which involves applying the sub-layer's main function, adding a residual connection from the input, and then performing layer normalization. If x represents the input to a sub-layer and F(x) represents the output of that sub-layer's main function, which of the following expressions correctly computes the final output of that sub-layer?

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science