Architecture of a BERT-based Encoder-Decoder Model
In a common sequence-to-sequence architecture, a pre-trained BERT model serves as the encoder. The source text, formatted with special tokens like [CLS] and [SEP], is converted into embeddings and fed into the BERT encoder. The resulting contextualized representations are then passed to a separate decoder, often through an optional adapter layer that helps align the encoder's output with the decoder's input space. The decoder then autoregressively generates the target text, starting from a special token like <s> and using the previously generated tokens as input for the next step.
The overall data flow can be visualized as follows:
Source Text: [CLS] x1 ... xm [SEP] ↓ Embeddings: ex_cls ex_1 ... ex_m ex_m+1 ↓ BERT (Encoder) ↓ Adapter ↓ Decoder Input: <s> y1 y2 ... yn−1 ↓ Embeddings: ey_0 ey_1 ey_2 ... ey_n−1 ↓ Decoder ↓ Target Text: y1 y2 y3 ... yn

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.1 Pre-training - Foundations of Large Language Models
Related
Architecture of a BERT-based Encoder-Decoder Model
An NLP team is developing a text summarization system using an encoder-decoder architecture. For the encoder component, they decide to initialize its parameters using a large, pre-trained bidirectional language model that was trained on a massive, general-purpose text corpus. The entire system is then fine-tuned on their specific summarization dataset. What is the primary advantage of this strategy compared to training the encoder from scratch?
Training Strategy for a BERT-based Encoder
When adapting a pre-trained bidirectional language model to serve as the encoder in a sequence-to-sequence architecture for a task like machine translation, it is standard practice to freeze the encoder's parameters and only train the randomly initialized decoder.
Learn After
Role of the Adapter in BERT-based Encoder-Decoder Models
Notation in a BERT-based Encoder-Decoder Architecture
BERT-based Encoder-Decoder for Neural Machine Translation
A developer is explaining the process of generating a target text sequence using an architecture composed of a pre-trained encoder and a separate decoder. Analyze the following statements from their explanation. Which statement incorrectly describes the relationship between the encoder's output and the decoder's input during the generation process?
A sequence-to-sequence model uses a pre-trained text model as its encoder and a separate model as its decoder. Arrange the following steps to accurately represent the data flow from the initial source text to the final generated target text.
Diagnosing an Encoder-Decoder Model Failure