Google

The BERT-large model represents a substantially deeper and wider version of the standard architecture, defined by an expanded set of hyperparameters. It features a hidden size ($$d$$) of 1024, a model depth of 24 Transformer layers ($$L$$), and utilizes 16 attention heads ($$n_{\mathrm{head}}$$). This scaled-up configuration yields a network comprising a total of 340 million parameters.

BERT-large Hyperparameters

A research team is deciding between two pre-trained language models for a complex text classification task. Model A has 12 transformer layers, a hidden size of 768, and 12 attention heads. Model B has 24 transformer layers, a hidden size of 1,024, and 16 attention heads. What is the most critical trade-off the team must evaluate when considering Model B over Model A?

Match each hyperparameter of the BERT-large model to its correct value.

The BERT-large model, which has a total of 340 million parameters, is built using 24 Transformer layers and a hidden size of 1,024. This architecture utilizes ____ attention heads in each layer.

Learn Before

Related