Learn Before
Example

BERT-large Hyperparameters

The BERT-large model represents a substantially deeper and wider version of the standard architecture, defined by an expanded set of hyperparameters. It features a hidden size (dd) of 1024, a model depth of 24 Transformer layers (LL), and utilizes 16 attention heads (nheadn_{\mathrm{head}}). This scaled-up configuration yields a network comprising a total of 340 million parameters.

Image 0

0

1

Updated 2026-04-17

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences