1Cademy - The BERT-large model, which has a total of 340 million parameters, is built using 24 Transformer layers and a hidden size of 1,024. This architecture utilizes ___

Learn Before

BERT-large Hyperparameters

Fill in the Blank

The BERT-large model, which has a total of 340 million parameters, is built using 24 Transformer layers and a hidden size of 1,024. This architecture utilizes ____ attention heads in each layer.

Updated 2025-10-09

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Recall in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

A research team is deciding between two pre-trained language models for a complex text classification task. Model A has 12 transformer layers, a hidden size of 768, and 12 attention heads. Model B has 24 transformer layers, a hidden size of 1,024, and 16 attention heads. What is the most critical trade-off the team must evaluate when considering Model B over Model A?
Match each hyperparameter of the BERT-large model to its correct value.
The BERT-large model, which has a total of 340 million parameters, is built using 24 Transformer layers and a hidden size of 1,024. This architecture utilizes ____ attention heads in each layer.

Learn Before

Related