Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

https://paperswithcode.com/method/bert

Ohio State University, Columbus

University of Michigan - Ann Arbor

BERT (Bidirectional Encoder Representations from Transformers) is a prominent pre-trained sequence encoding model in NLP. As a foundational model, it is trained using a self-supervised approach that combines two tasks: masked language modeling (MLM) and next sentence prediction (NSP). In MLM, the model predicts randomly masked words from their context, enabling it to learn deep bidirectional language representations. This dual-task training makes BERT a versatile foundation model adaptable to a wide array of NLP applications.

BERT (Bidirectional Encoder Representations from Transformers)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

 $\textbf{B}$idirectional $\textbf{E}$ncoder $\textbf{R}$epresentations from $\textbf{T}$ransformers
- A new language representation model that is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers
- Beneficial because it is a generalist approach that allows for customization to specific tasks (question answering, language inference, etc) by adding just one additional layer, while other approaches might build entirely different models for different language-processing tasks

What is BERT?

The fundamental structure of a BERT model is a multi-layer Transformer network. This deep network is created by stacking numerous Transformer layers. The final output from the network's last layer is a sequence of real-valued vectors, where each vector corresponds to a position in the input sequence.

BERT's Core Architecture

Vocabulary size is a significant design consideration in BERT models. Each input token is mapped to an entry in a predefined vocabulary. A key trade-off exists: while a larger vocabulary can represent a wider variety of word forms, it also leads to greater storage requirements for the model.

Vocabulary Size Trade-off in BERT

In Transformer-based models, the embedding size, denoted as $d_e$, specifies the dimension of the real-valued vectors that represent each token. The final input vector for a token, as well as its constituent parts—the token embedding, positional embedding, and segment embedding—all share this same dimensionality.

Embedding Size in Transformer Models

The size of a BERT model is directly influenced by the configuration of its various hyperparameters. Adjusting these settings, such as the number of layers or attention heads, results in different model versions with varying sizes. For example, two widely-used BERT models exist, each with a distinct size determined by its specific hyperparameter settings.

BERT Model Sizes and Hyperparameters

A key research direction following the introduction of BERT involves scaling up the model. This strategy focuses on enhancing performance by increasing the volume of training data and constructing larger model architectures.

Strategies for Improving BERT: Model Scaling

Since the original BERT model was developed primarily for English, two main strategies have emerged to extend its capabilities to other languages. The first approach involves creating separate, dedicated models for each individual language. The second, more common approach, is to train a single multilingual model using a combined dataset from all targeted languages.

Approaches to Extending BERT for Multilingual Support

BERT's utility extends beyond language understanding to any task requiring text encoding. A significant application is in text generation, where tasks like machine translation, summarization, question answering, and dialogue generation are framed as sequence-to-sequence problems. In this setup, a BERT model can function as the encoder. The implementation involves initializing the encoder's weights with a pre-trained BERT model and then fine-tuning the entire encoder-decoder system on task-specific text pairs.

Learn Before

Related