1Cademy - Architecture of a BERT-based Encoder-Decoder Model

Learn Before

Using BERT as an Encoder in Sequence-to-Sequence Models

Example

Architecture of a BERT-based Encoder-Decoder Model

In a common sequence-to-sequence architecture, a pre-trained BERT model serves as the encoder. The source text, formatted with special tokens like [CLS] and [SEP], is converted into embeddings and fed into the BERT encoder. The resulting contextualized representations are then passed to a separate decoder, often through an optional adapter layer that helps align the encoder's output with the decoder's input space. The decoder then autoregressively generates the target text, starting from a special token like <s> and using the previously generated tokens as input for the next step.

The overall data flow can be visualized as follows:

Source Text:  [CLS] x1 ... xm [SEP]
                ↓
Embeddings:   ex_cls ex_1 ... ex_m ex_m+1
                ↓
             BERT (Encoder)
                ↓
              Adapter
                ↓
Decoder Input:  <s> y1 y2 ... yn−1
                ↓
Embeddings:   ey_0 ey_1 ey_2 ... ey_n−1
                ↓
              Decoder
                ↓
Target Text:  y1 y2 y3 ... yn

0

1

Updated 2026-04-18

Contributors are:

Who are from:

References

Learn Before

Related

Learn After