Example

Architecture of a BERT-based Encoder-Decoder Model

In a common sequence-to-sequence architecture, a pre-trained BERT model serves as the encoder. The source text, formatted with special tokens like [CLS] and [SEP], is converted into embeddings and fed into the BERT encoder. The resulting contextualized representations are then passed to a separate decoder, often through an optional adapter layer that helps align the encoder's output with the decoder's input space. The decoder then autoregressively generates the target text, starting from a special token like <s> and using the previously generated tokens as input for the next step.

The overall data flow can be visualized as follows:

Source Text: [CLS] x1 ... xm [SEP] ↓ Embeddings: ex_cls ex_1 ... ex_m ex_m+1 ↓ BERT (Encoder) ↓ Adapter ↓ Decoder Input: <s> y1 y2 ... yn−1 ↓ Embeddings: ey_0 ey_1 ey_2 ... ey_n−1 ↓ Decoder ↓ Target Text: y1 y2 y3 ... yn
Image 0

0

1

Updated 2026-04-18

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.1 Pre-training - Foundations of Large Language Models