The Megatron research by Shoeybi et al. (2019) provides a significant example of training a massive 3.9 billion-parameter BERT-style model. To overcome the extreme computational requirements and manage the complex parallel computation of such a scale, the training process necessitated the synchronized use of hundreds of GPUs.

Google

A second approach to improving BERT-style models is to significantly increase the total number of model parameters. This architectural scaling is typically accomplished by expanding both the model depth and the hidden size.

Improving BERT Models by Increasing Parameters

To overcome the instability and convergence difficulties encountered when training extremely large pre-trained models, researchers must carefully consider several complex engineering aspects. Key elements to manage include the specific model architecture, the implementation of parallel computation, and the techniques used for parameter initialization.

Considerations for Stabilizing Large-Scale Model Training

Reference of Foundations of Large Language Models Course

The development of a 1.5 billion-parameter DeBERTa model by He et al. (2021) serves as a prime example of scaling up BERT-like architectures. This massive scale was achieved by significantly expanding both the hidden size and the overall depth of the model.

Learn Before

Related