Concept

BERT's Core Architecture

In a BERT model, input tokens are initially represented as embeddings—calculated as the sum of their corresponding token, positional, and segment embeddings. This combined embedding sequence is then processed by the core architecture, which is a deep, multi-layer Transformer network formed by stacking numerous Transformer layers. Each layer in this stack is composed of a self-attention sub-layer and a feed-forward network (FFN) sub-layer, both of which utilize a post-norm architecture. In this structure, the output is calculated as output=LNorm(F(input)+input)\mathrm{output} = \mathrm{LNorm}(F(\mathrm{input}) + \mathrm{input}), where F()F(\cdot) represents the sub-layer's main function (self-attention or FFN) and LNorm()\mathrm{LNorm}(\cdot) is layer normalization. The final output produced by the network's last Transformer layer is a sequence of real-valued vectors, with one vector corresponding to each position in the input sequence.

Image 0

0

1

Updated 2026-04-17

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related