1Cademy - BERT-large Hyperparameters

Learn Before

BERT Model Sizes and Hyperparameters

Example

BERT-large Hyperparameters

The BERT-large model represents a substantially deeper and wider version of the standard architecture, defined by an expanded set of hyperparameters. It features a hidden size ( $d$ ) of 1024, a model depth of 24 Transformer layers ( $L$ ), and utilizes 16 attention heads ( $n_{\mathrm{head}}$ ). This scaled-up configuration yields a network comprising a total of 340 million parameters.