1Cademy - BERTs Core Architecture

Learn Before

Concept

BERT's Core Architecture

In a BERT model, input tokens are initially represented as embeddings—calculated as the sum of their corresponding token, positional, and segment embeddings. This combined embedding sequence is then processed by the core architecture, which is a deep, multi-layer Transformer network formed by stacking numerous Transformer layers. Each layer in this stack is composed of a self-attention sub-layer and a feed-forward network (FFN) sub-layer, both of which utilize a post-norm architecture. In this structure, the output is calculated as $\mathrm{output} = \mathrm{LNorm}(F(\mathrm{input}) + \mathrm{input})$ , where $F(\cdot)$ represents the sub-layer's main function (self-attention or FFN) and $\mathrm{LNorm}(\cdot)$ is layer normalization. The final output produced by the network's last Transformer layer is a sequence of real-valued vectors, with one vector corresponding to each position in the input sequence.

Updated 2026-04-17

Contributors are:

Who are from:

Learn Before

Related

Learn After