Learn Before
BERT-large Hyperparameters
The BERT-large model represents a substantially deeper and wider version of the standard architecture, defined by an expanded set of hyperparameters. It features a hidden size () of 1024, a model depth of 24 Transformer layers (), and utilizes 16 attention heads (). This scaled-up configuration yields a network comprising a total of 340 million parameters.

0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
BERT-base Hyperparameters
BERT-large Hyperparameters
Challenges of Large-Scale BERT Models
A team is developing a large, bidirectional, transformer-based language model. Their initial design has 12 processing layers, a hidden state dimension of 768, and 12 attention heads. To significantly increase the model's capacity, they are considering two potential modifications. Which single change would result in a greater increase in the model's total number of parameters?
Model Selection for a Resource-Constrained Application
You are presented with two common configurations for a bidirectional, transformer-based language model. Match each model scale to its corresponding set of architectural hyperparameters.
Learn After
A research team is deciding between two pre-trained language models for a complex text classification task. Model A has 12 transformer layers, a hidden size of 768, and 12 attention heads. Model B has 24 transformer layers, a hidden size of 1,024, and 16 attention heads. What is the most critical trade-off the team must evaluate when considering Model B over Model A?
Match each hyperparameter of the BERT-large model to its correct value.
The BERT-large model, which has a total of 340 million parameters, is built using 24 Transformer layers and a hidden size of 1,024. This architecture utilizes ____ attention heads in each layer.