The BERT-base model is configured with specific hyperparameters that determine its size and architecture. These include a hidden size ($d$) of 768, 12 Transformer layers ($L$), and 12 attention heads ($n_{head}$). This configuration results in a model with a total of 110 million parameters.

Google

The size of a BERT model is directly influenced by the configuration of its various hyperparameters. Adjusting these settings, such as the number of layers or attention heads, results in different model versions with varying sizes. For example, two widely-used BERT models exist, each with a distinct size determined by its specific hyperparameter settings.

BERT Model Sizes and Hyperparameters

Reference of Foundations of Large Language Models Course

BERT-base Hyperparameters

The BERT-large model is a larger version of BERT, characterized by a hidden size ($d$) of 1,024, 24 Transformer layers ($L$), and 16 attention heads ($n_{head}$). This configuration results in a model with a total of 340 million parameters.

BERT-large Hyperparameters

When it was introduced, BERT was considered a very large model for its time. This significant size creates practical challenges, including substantial memory requirements and slower system performance, which compound the high computational cost and effort involved in pre-training.

Learn Before

Related