Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of HLT-NAACL. Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423

Carleton College

*Encoder only* -> natural language understanding
- BERT
- RoBERTa (removes NSP task from BERT)

*Decoder only* -> language modeling
- GPT series

*Encoder-Decoder* -> equipped with the ability to perform both natural language understanding and generation
- BART
- T5

Examples of Pre-trained Transformers by Architecture

BERT

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of ACL. 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703

BART

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683 [cs.LG]. https://arxiv.org/abs/1910.10683

BERT (Bidirectional Encoder Representations from Transformers) stands out as one of the most popular and extensively used pre-trained sequence encoding models in the field of Natural Language Processing. As a foundational model, it is trained using a self-supervised approach that combines two tasks: masked language modeling (MLM) and next sentence prediction (NSP). In MLM, the model predicts randomly masked words from their context, enabling it to learn deep bidirectional language representations. This dual-task training makes BERT a versatile foundation model adaptable to a wide array of NLP applications.

BERT (Bidirectional Encoder Representations from Transformers)

RoBERTa is an enhanced version of BERT that serves as a key example of improving performance by scaling up training. The model's development led to two significant findings: 1) A BERT-style model's performance can be substantially improved by training it with more data and compute, without any changes to the model's architecture. 2) The Next Sentence Prediction (NSP) objective is not essential for strong downstream performance and can be removed, as long as the model is trained at a sufficiently large scale.

RoBERTa

GPT Series

LLaMA2 is a family of large language models introduced in 2023. It includes prominent versions such as the 65-billion-parameter LLaMA2-65B, which was pre-trained on between 1.0 trillion and 1.4 trillion tokens. The training data for LLaMA2 comes from a diverse mix of public sources, including webpages, software code, Wikipedia, books, academic papers, and question-and-answer content.

LLaMA2

DeepSeek-V3 is a large language model that was introduced by Liu et al. in their 2024a publication.

DeepSeek-V3

Falcon is a family of large language models. A notable version is Falcon-180B, which contains 180 billion parameters and was pre-trained on 3.5 trillion tokens derived from diverse sources including webpages, books, conversations, code, and technical articles.

Falcon

Mistral is a large language model that was introduced by Jiang et al. in 2023.

Mistral

PaLM-450B is a large language model introduced in 2022. It was pre-trained on a massive corpus of 0.78 trillion tokens, utilizing diverse data sources that include webpages, books, conversations, code, Wikipedia, and news articles.

PaLM-450B

Gemma-7B is a large language model pre-trained on a massive dataset of 6 trillion tokens. Its pre-training data is sourced from webpages, mathematics content, and code.

Gemma-7B

Gemma2 is a large language model that was introduced by Team et al. in 2024.

Gemma2

A software development team is tasked with building a feature that can automatically generate a concise, one-paragraph summary from a long news article. The system needs to first comprehend the full context of the source article and then generate a new, coherent summary. Based on the typical strengths of different foundational model designs, which of the following models would be the most suitable choice for this specific task?

Match each pre-trained model with the description that best fits its architectural design and primary use case.

Evaluate the junior data scientist's choice of a decoder-only architecture for the sentiment analysis task described in the case study. Is this an optimal choice? Justify your reasoning by explaining the primary strengths of this architecture and suggesting a more suitable architectural design if you believe one exists.

Learn Before

Related