1Cademy - Model Depth in Transformers

Approach 1: Processes the input text sequentially, token by token, updating an internal state that is passed from one step to the next.
Approach 2: Processes all input tokens simultaneously, using a mechanism that directly relates every token to every other token in the input to determine context.

Learn Before

Transformer
Considerations in BERT Model Development
Key Hyperparameters of a Transformer Encoder

Definition

Model Depth in Transformers

The expressive power of Transformer networks can be effectively enhanced by increasing the model depth, denoted by $L$ , which represents the total number of stacked processing layers. In standard BERT architectures, the depth $L$ is typically configured to either 12 or 24. However, employing networks with even greater depth is a viable strategy to achieve further performance enhancements.