The development of a 1.5 billion-parameter DeBERTa model by He et al. (2021) serves as a prime example of scaling up BERT-like architectures. This massive scale was achieved by significantly expanding both the hidden size and the overall depth of the model.

Google

A second approach to improving BERT-style models is to significantly increase the total number of model parameters. This architectural scaling is typically accomplished by expanding both the model depth and the hidden size.

Improving BERT Models by Increasing Parameters

Reference of Foundations of Large Language Models Course

Example of Scaling BERT Parameters to 1.5 Billion

The Megatron research by Shoeybi et al. (2019) provides a significant example of training a massive 3.9 billion-parameter BERT-style model. To overcome the extreme computational requirements and manage the complex parallel computation of such a scale, the training process necessitated the synchronized use of hundreds of GPUs.

Learn Before

Related