- Simple RNNs, LSTMs, GRUs, convolutional networks, and transformer networks can be employed as encoders.
- Stacked Bi-LSTMs is widely used

Encoder

- For the decoder, autoregressive generation is used to output a sequence, an element at a time, until an end-of-sentence marker appears.
- Typically, use an LSTM or GRU-based RNN, where the context vector consists of the final hidden state of the encoder and is used to initialize the first hidden state of the decoder. 
- To avoid the fading influence of context vector during the decoding process, a solution is to add the context vector as a parameter to the computation of the current hidden state.
- In order to keep track of what has already been generated and what hasn’t, condition the output on three parts, the newly generated hidden state, the output generated at the previous state, and the encoder context.
- Beam search is used to optimize the output, preventing the unreliable result by independently choosing the argmax over a sequence.

Decoder

The number of hidden states generated from the encoding process varies with the size of the input, making it difficult to use them directly as a context for the decode. - Solution 1: basic RNN-based architecture     - Advantage: simple; reduce the context to a fixed-length vector.     - Drawback: the final hidden state is more focused on the latter parts of the input sequence. - Solution 2: Bi-RNNs     - Advantage: focuses on the input as a whole, rather than only the latter parts.     - Drawback: loses information about each of the individual encoder states that might be useful in decoding. - Solution 3: attention mechanism     - Advantages: considers the whole encoder context; dynamically updates during decoding; can be embodied in a fixed-size vector.

Context vector

The encoder-decoder architecture can also be implemented using transformers, consisting of: - An encoder that takes the source language input words $$X = x_1, ..., x_T$$ and maps them to an output representation $$H^{enc} = h_1, ..., h_T$$; usually via $$N = 6$$ stacked encoder blocks. - A decoder which is similar to the one within the encoder-decoder RNN. However, the decoder transformer block includes an extra cross-attention layer in order to attend to the source language.

Encoder-Decoder with Transformers

To achieve effectiveness in multi-lingual and cross-lingual applications like machine translation, pre-trained encoder-decoder models require training on multi-lingual data. This process necessitates a vocabulary containing tokens from all involved languages, which allows the model to learn shared representations and develop capabilities for both understanding and generation across different languages.

Multi-lingual Pre-training for Encoder-Decoder Models

An encoder-decoder architecture functions by mapping an input sequence, denoted as $$\mathbf{x}$$, to a corresponding output sequence, $$\mathbf{y}$$. This end-to-end transformation is mathematically expressed as $$\mathbf{y} = \mathrm{Model}_{\theta, \omega}(\mathbf{x})$$, which emphasizes that the model relies on two separate sets of parameters: $$\theta$$ for the encoder and $$\omega$$ for the decoder. When broken down into its two primary operations, the formula becomes $$\mathbf{y} = \mathrm{Decode}_{\omega}(\mathrm{Encode}_{\theta}(\mathbf{x}))$$. This detailed expression illustrates that the encoder function, utilizing parameters $$\theta$$, first processes the input sequence $$\mathbf{x}$$ to build an internal representation. Subsequently, the decoder function, governed by parameters $$\omega$$, uses this representation to construct the final output sequence $$\mathbf{y}$$.

Mathematical Formulation of an Encoder-Decoder Model

Sequence-to-sequence (seq2seq) models, which utilize both an encoder and a decoder, are a standard framework for text generation tasks. This approach is suitable for applications like machine translation, summarization, question answering, and dialogue generation, where a source text is mapped to a target text. Models like T5 and mBART are prominent examples of pre-trained seq2seq models. This framework is versatile, allowing both Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks to be addressed and fine-tuned within the same architecture.

Seq2seq Models for Text Generation

Auto-regressive decoding is a process used in tasks like machine translation where each token in the target-language output sequence is generated sequentially. The generation of each new token is conditioned on two sources of information: the tokens that have already been generated in the target sequence and the complete source-language input sequence.

Auto-Regressive Decoding in Machine Translation

Encoder-decoder architectures are highly versatile for NLP tasks. Beyond standard sequence-to-sequence problems, their application can be generalized by considering text as both the input and output of a problem. This simple idea allows encoder-decoder models to be directly applied to a wide array of NLP challenges. For instance, sentiment analysis can be framed as a text-to-text task where the model takes a text as input and generates an output text describing the sentiment, such as 'positive', 'negative', or 'neutral'.

Applying Encoder-Decoder Architectures to NLP via the Text-to-Text Framework

A sequence-to-sequence model is designed to translate English sentences into French. When given the English input, 'The quick brown fox jumps over the lazy dog,' the model produces the French output, 'Où est la bibliothèque?' ('Where is the library?'). The generated French sentence is grammatically perfect and fluent, but it is completely unrelated to the meaning of the English input. Based on this specific failure, which component of the underlying architecture is most likely the primary source of the error?

A team has built a model to summarize long news articles. The model's architecture consists of two main components: a processing component that reads the entire source article and compresses it into a single, fixed-size numerical representation (a context vector), and a generation component that uses only this single vector to write the summary. During testing, the team observes a consistent problem: the generated summaries are fluent and grammatically correct, but they only seem to reflect information from the *end* of the article, ignoring key points from the beginning and middle. Based on the described flow of information, what is the most likely reason for this specific failure?

Diagnosing an Architectural Flaw in a Summarization Model

Arrange the following events to accurately describe the flow of information in a standard encoder-decoder architecture for a sequence-to-sequence task.

Your team is pretraining an internal T5-style enco...

Your company wants one internal model to support m...

Your team is pretraining an internal T5-style mode...

Your team is building a single internal T5-style t...

Your team pretrains an encoder–decoder model using span-based denoising: the encoder receives inputs where contiguous spans are replaced by unique sentinel tokens (e.g., <extra_id_0>, <extra_id_1>), and the decoder is trained to output the missing spans in order, each preceded by the corresponding sentinel token. After this pretraining, you fine-tune the same model in a text-to-text setup where every example is formatted as a single input string containing a task prefix plus content (e.g., "summarize: <document>", "translate English to German: <sentence>") and the target is the desired output text.

In production, you observe a consistent failure mode: regardless of the task prefix, the model often outputs fluent text that looks like it is "filling in missing pieces" of the input rather than performing the requested task (e.g., for "summarize:" it tends to restate or continue the document; for "translate:" it produces English paraphrases).

Write an analysis that (1) explains, using the roles of the encoder and decoder in an encoder–decoder network, why a span-denoising objective can bias the model toward this behavior when combined with a text-to-text task-prefix interface, and (2) proposes two concrete, implementable changes to the fine-tuning data formatting and/or training setup (not model size) that would make the task prefix more causally influential on the decoder’s outputs. For each proposed change, justify the expected mechanism of improvement and one tradeoff or risk it introduces.

Diagnosing a T5-Style Model That Ignores Task Prefixes After Span-Denoising Pretraining

You lead an applied NLP team building a single internal “language workbench” model for three enterprise workflows: (1) generate a 1–2 sentence customer-support ticket summary, (2) extract a JSON-like list of product names mentioned in a ticket, and (3) rewrite a ticket into a more polite tone. Leadership wants one model that can do all three by changing only a textual instruction prefix (e.g., “summarize: …”, “extract products: …”, “rewrite politely: …”).

You have budget for either (A) large-scale self-supervised pretraining of an encoder–decoder model using span-based denoising (mask contiguous spans in the input with sentinel tokens and train the decoder to output the missing spans with those sentinels), followed by light fine-tuning on a small labeled set for each workflow, or (B) no denoising pretraining, but heavier supervised fine-tuning on larger labeled sets for each workflow.

Write an evaluation memo recommending A or B. Your memo must explicitly connect: (i) how the encoder–decoder architecture supports instruction-conditioned text-to-text behavior across these heterogeneous tasks, and (ii) how span-based denoising changes what the encoder and decoder learn (and why that matters for both generation tasks like summarization/rewriting and “structured” generation like product extraction). Include at least two concrete risks/tradeoffs of your chosen option (e.g., failure modes, data requirements, output controllability), and propose one mitigation for each risk.

Choosing Between Span-Denoising Pretraining and Task-Specific Fine-Tuning in a T5-Style Text-to-Text System

You lead an internal platform team that wants to standardize three product features on a single model: (1) customer-email summarization, (2) sentiment classification of the email (output should be a label like "positive"/"negative"), and (3) extracting the top 3 action items as short bullet phrases. You are considering a T5-style approach.

Write an essay that proposes (a) how you would represent all three features in a single text-to-text interface (i.e., what the input and output strings would look like, including how the model is instructed), and (b) how span-based denoising pretraining for an encoder–decoder network supports this unified approach.

In your answer, explicitly connect the encoder’s role, the decoder’s role, and the span-masking/sentinel-token target format to why the same architecture can handle both “generate long text” (summaries) and “generate short text” (labels/action items). Also discuss at least one practical tradeoff or failure mode you would anticipate if the text-to-text prompts or the denoising objective are poorly aligned with the downstream tasks.

Designing a Unified Text-to-Text Model and Pretraining Objective for Multiple NLP Features

You are rolling out a single internal NLP service based on an encoder–decoder model in a T5-style text-to-text setup. Every request is formatted as plain text with an instruction prefix (e.g., "summarize:", "translate en->de:", "extract entities:") followed by the user content, and the model always generates a text output.

After pretraining, the team reports a consistent failure mode across multiple downstream tasks: outputs are fluent and on-topic for the *instruction*, but they often ignore key facts from the provided input. For example:
- Input: "summarize: The incident report states the outage lasted 17 minutes and affected only EU customers." Output: "A brief outage impacted customers for about an hour across multiple regions."
- Input: "extract entities: Contract signed by Acme Corp on 2024-01-12 for $2.3M." Output: "Acme Corp; 2023-12-01; $3.0M"

You inspect the pretraining pipeline and find it uses span-based denoising with sentinel tokens, but the data engineer implemented the decoder target as the *entire original uncorrupted text* (i.e., the decoder is trained to reproduce the full input sequence), rather than the standard T5-style target that concatenates only the missing spans with their sentinel tokens.

As the model owner, analyze how this specific pretraining-target mistake would change what the encoder and decoder learn in an encoder–decoder network, and explain why that would plausibly lead to the observed "instruction-following but input-unfaithful" behavior in a text-to-text system. Provide one concrete correction to the pretraining objective/format that would directly address the issue.

Root-Cause Analysis of a T5-Style Model Producing Fluent but Unfaithful Outputs

You are designing a single internal NLP service for a regulated enterprise that must support multiple text-in/text-out features behind one API: (1) "summarize" long incident reports into 3 bullet points, (2) "extract" a comma-separated list of product names mentioned in a customer email, and (3) "rewrite" a draft response to be more formal. The platform team wants one model that can switch behavior based on a textual instruction prefix (e.g., "summarize:", "extract:", "rewrite:") and can be pre-trained on a large corpus of unlabeled internal documents before any task-specific fine-tuning.

A prototype team proposes using an encoder-only model with a classification head for extraction and a separate decoder-only model for summarization/rewriting, arguing it will be simpler. Another team proposes a single encoder-decoder model trained in a T5-style text-to-text framework, using span-based denoising pretraining (masking contiguous spans in the input with sentinel tokens and training the decoder to output the missing spans) before fine-tuning on the three tasks.

As the technical reviewer, which proposal would you approve and why? In your answer, explicitly connect (a) how the text-to-text instruction prefix interacts with the chosen architecture, and (b) how span-based denoising pretraining prepares (or fails to prepare) the model for all three downstream behaviors within one system.

Selecting an Architecture and Pretraining Objective for a Unified Internal NLP Service

You are rolling out an internal, single-model NLP service based on a T5-style text-to-text approach. The model is an encoder–decoder network that was pre-trained with span-based denoising using sentinel tokens (e.g., <extra_id_0>, <extra_id_1>) and then fine-tuned on multiple tasks using textual task prefixes.

After deployment, two issues appear:
1) For classification-style requests (e.g., "sentiment: <text>"), the model often outputs strings that look like "<extra_id_0> positive" or "<extra_id_0> negative" instead of just "positive"/"negative".
2) For generation-style requests (e.g., "summarize: <article>"), the model sometimes inserts sentinel tokens into the summary.

A teammate proposes a quick fix: "Strip any <extra_id_*> tokens from the model output at inference time and ship." Another teammate argues the root cause is in how inputs/targets are being constructed for fine-tuning and that the fix should be in the text-to-text formatting and training pipeline.

As the reviewer, analyze which teammate is more correct and justify your decision by explaining (a) how span-based denoising trains an encoder–decoder model to use sentinel tokens, (b) how the text-to-text task prefix + target formatting should differ between denoising pretraining and downstream fine-tuning, and (c) one concrete change you would make to the fine-tuning data (source/target strings) to prevent sentinel-token leakage without relying on post-processing.

Post-Pretraining Data Formatting Bug in a T5-Style Text-to-Text Service

A second method for pre-training encoder-decoder models involves masked language modeling. In this technique, specific tokens within an input sequence are randomly substituted with a mask symbol. The model is subsequently trained to predict the identities of these masked tokens by analyzing the entirety of the masked sequence.

Pre-training Encoder-Decoder Models via Masked Language Modeling

The abstracted architecture of the twisted autoregressive generation is referred to as the encoder-decoder architecture.

It consists of three parts:
- encoder: accepts a sequence as its input and generates a corresponding sequence of contextualized representations (hidden states);
- context vector: is a function of the vector of contextualized representations generated from the encoder and conveys the essence of the input to the decoder;
- decoder: takes context vector as input and generates an arbitrary length sequence of hidden states, therefore obtains a corresponding sequence of output states.

The encoder and decoder networks are typically implemented with the same architecture, often using recurrent networks, but there are some other possibilities in each part.

University of Michigan - Ann Arbor

Google

Instead of generating a sentence from scratch, the language model can complete a sequence given a specified prefix as following.

1. Pass a specified prefix to the language model using forward inference to produce a sequence of hidden states;
2. Apply autoregressive generation using the hidden state of the last word of the prefix as the starting point of generation;
3. The result of this process is a sequence of words that should be a reasonable completion given the prefix input.

A variation on autoregressive generation

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Speech and Language Processing (3rd ed. draft) 

Reference of Foundations of Large Language Models Course

The data used to train the model are known as parallel texts, or bitexts.
bitexts = source + </s> + target
, where source is the text being translated, target is the translation output, and </s> is the end-of-sentence token.

Learn Before

Related

Learn After