An encoder-decoder model is being trained with a span-based denoising objective. The encoder is given the following corrupted input text: 'To learn about the solar system, we first study <mask_0> and then move on to <mask_1> planets.' The original, uncorrupted text for the masked spans is '<mask_0>' = 'the Sun' and '<mask_1>' = 'the other'. What should the target output sequence for the decoder be in this training step?

An encoder-decoder model is trained on a denoising task where corrupted text is provided as input. One training approach requires the decoder to reconstruct the *entire* original, uncorrupted text. A second, 'span-based' approach requires the decoder to generate *only* the text that was masked in the input, preceded by special tokens indicating which mask is being filled. Explain the key difference in the target sequence length for the decoder in the span-based approach and identify a primary computational benefit of this method.

Analysis of Denoising Training Objectives

An engineer is training an encoder-decoder model with a span-based denoising objective. The input to the model's encoder is: 'The model <mask_A> to fill in the <mask_B> spans.' The engineer is unsure how to format the target sequence for the decoder. They consider two options:

*   **Option 1:** 'The model learns to fill in the missing text spans.'
*   **Option 2:** '<mask_A> learns <mask_B> missing text'

Which option is the correct target for this specific training objective, and why is the other option incorrect or less efficient for this task?

Debugging a Span-Based Denoising Training Pipeline

Your team is pretraining an internal T5-style enco...

Your company wants one internal model to support m...

Your team is pretraining an internal T5-style mode...

Your team is building a single internal T5-style t...

Your team pretrains an encoder–decoder model using span-based denoising: the encoder receives inputs where contiguous spans are replaced by unique sentinel tokens (e.g., <extra_id_0>, <extra_id_1>), and the decoder is trained to output the missing spans in order, each preceded by the corresponding sentinel token. After this pretraining, you fine-tune the same model in a text-to-text setup where every example is formatted as a single input string containing a task prefix plus content (e.g., "summarize: <document>", "translate English to German: <sentence>") and the target is the desired output text.

In production, you observe a consistent failure mode: regardless of the task prefix, the model often outputs fluent text that looks like it is "filling in missing pieces" of the input rather than performing the requested task (e.g., for "summarize:" it tends to restate or continue the document; for "translate:" it produces English paraphrases).

Write an analysis that (1) explains, using the roles of the encoder and decoder in an encoder–decoder network, why a span-denoising objective can bias the model toward this behavior when combined with a text-to-text task-prefix interface, and (2) proposes two concrete, implementable changes to the fine-tuning data formatting and/or training setup (not model size) that would make the task prefix more causally influential on the decoder’s outputs. For each proposed change, justify the expected mechanism of improvement and one tradeoff or risk it introduces.

Diagnosing a T5-Style Model That Ignores Task Prefixes After Span-Denoising Pretraining

You lead an applied NLP team building a single internal “language workbench” model for three enterprise workflows: (1) generate a 1–2 sentence customer-support ticket summary, (2) extract a JSON-like list of product names mentioned in a ticket, and (3) rewrite a ticket into a more polite tone. Leadership wants one model that can do all three by changing only a textual instruction prefix (e.g., “summarize: …”, “extract products: …”, “rewrite politely: …”).

You have budget for either (A) large-scale self-supervised pretraining of an encoder–decoder model using span-based denoising (mask contiguous spans in the input with sentinel tokens and train the decoder to output the missing spans with those sentinels), followed by light fine-tuning on a small labeled set for each workflow, or (B) no denoising pretraining, but heavier supervised fine-tuning on larger labeled sets for each workflow.

Write an evaluation memo recommending A or B. Your memo must explicitly connect: (i) how the encoder–decoder architecture supports instruction-conditioned text-to-text behavior across these heterogeneous tasks, and (ii) how span-based denoising changes what the encoder and decoder learn (and why that matters for both generation tasks like summarization/rewriting and “structured” generation like product extraction). Include at least two concrete risks/tradeoffs of your chosen option (e.g., failure modes, data requirements, output controllability), and propose one mitigation for each risk.

Choosing Between Span-Denoising Pretraining and Task-Specific Fine-Tuning in a T5-Style Text-to-Text System

You lead an internal platform team that wants to standardize three product features on a single model: (1) customer-email summarization, (2) sentiment classification of the email (output should be a label like "positive"/"negative"), and (3) extracting the top 3 action items as short bullet phrases. You are considering a T5-style approach.

Write an essay that proposes (a) how you would represent all three features in a single text-to-text interface (i.e., what the input and output strings would look like, including how the model is instructed), and (b) how span-based denoising pretraining for an encoder–decoder network supports this unified approach.

In your answer, explicitly connect the encoder’s role, the decoder’s role, and the span-masking/sentinel-token target format to why the same architecture can handle both “generate long text” (summaries) and “generate short text” (labels/action items). Also discuss at least one practical tradeoff or failure mode you would anticipate if the text-to-text prompts or the denoising objective are poorly aligned with the downstream tasks.

Designing a Unified Text-to-Text Model and Pretraining Objective for Multiple NLP Features

You are rolling out a single internal NLP service based on an encoder–decoder model in a T5-style text-to-text setup. Every request is formatted as plain text with an instruction prefix (e.g., "summarize:", "translate en->de:", "extract entities:") followed by the user content, and the model always generates a text output.

After pretraining, the team reports a consistent failure mode across multiple downstream tasks: outputs are fluent and on-topic for the *instruction*, but they often ignore key facts from the provided input. For example:
- Input: "summarize: The incident report states the outage lasted 17 minutes and affected only EU customers." Output: "A brief outage impacted customers for about an hour across multiple regions."
- Input: "extract entities: Contract signed by Acme Corp on 2024-01-12 for $2.3M." Output: "Acme Corp; 2023-12-01; $3.0M"

You inspect the pretraining pipeline and find it uses span-based denoising with sentinel tokens, but the data engineer implemented the decoder target as the *entire original uncorrupted text* (i.e., the decoder is trained to reproduce the full input sequence), rather than the standard T5-style target that concatenates only the missing spans with their sentinel tokens.

As the model owner, analyze how this specific pretraining-target mistake would change what the encoder and decoder learn in an encoder–decoder network, and explain why that would plausibly lead to the observed "instruction-following but input-unfaithful" behavior in a text-to-text system. Provide one concrete correction to the pretraining objective/format that would directly address the issue.

Root-Cause Analysis of a T5-Style Model Producing Fluent but Unfaithful Outputs

You are designing a single internal NLP service for a regulated enterprise that must support multiple text-in/text-out features behind one API: (1) "summarize" long incident reports into 3 bullet points, (2) "extract" a comma-separated list of product names mentioned in a customer email, and (3) "rewrite" a draft response to be more formal. The platform team wants one model that can switch behavior based on a textual instruction prefix (e.g., "summarize:", "extract:", "rewrite:") and can be pre-trained on a large corpus of unlabeled internal documents before any task-specific fine-tuning.

A prototype team proposes using an encoder-only model with a classification head for extraction and a separate decoder-only model for summarization/rewriting, arguing it will be simpler. Another team proposes a single encoder-decoder model trained in a T5-style text-to-text framework, using span-based denoising pretraining (masking contiguous spans in the input with sentinel tokens and training the decoder to output the missing spans) before fine-tuning on the three tasks.

As the technical reviewer, which proposal would you approve and why? In your answer, explicitly connect (a) how the text-to-text instruction prefix interacts with the chosen architecture, and (b) how span-based denoising pretraining prepares (or fails to prepare) the model for all three downstream behaviors within one system.

Selecting an Architecture and Pretraining Objective for a Unified Internal NLP Service

You are rolling out an internal, single-model NLP service based on a T5-style text-to-text approach. The model is an encoder–decoder network that was pre-trained with span-based denoising using sentinel tokens (e.g., <extra_id_0>, <extra_id_1>) and then fine-tuned on multiple tasks using textual task prefixes.

After deployment, two issues appear:
1) For classification-style requests (e.g., "sentiment: <text>"), the model often outputs strings that look like "<extra_id_0> positive" or "<extra_id_0> negative" instead of just "positive"/"negative".
2) For generation-style requests (e.g., "summarize: <article>"), the model sometimes inserts sentinel tokens into the summary.

A teammate proposes a quick fix: "Strip any <extra_id_*> tokens from the model output at inference time and ship." Another teammate argues the root cause is in how inputs/targets are being constructed for fine-tuning and that the fix should be in the text-to-text formatting and training pipeline.

As the reviewer, analyze which teammate is more correct and justify your decision by explaining (a) how span-based denoising trains an encoder–decoder model to use sentinel tokens, (b) how the text-to-text task prefix + target formatting should differ between denoising pretraining and downstream fine-tuning, and (c) one concrete change you would make to the fine-tuning data (source/target strings) to prevent sentinel-token leakage without relying on post-processing.

Post-Pretraining Data Formatting Bug in a T5-Style Text-to-Text Service

In a span-based denoising task for an encoder-decoder model, the training objective is to reconstruct only the original text from masked spans. The model's encoder processes an input sequence where one or more spans of text have been replaced by unique mask or sentinel tokens. The decoder is then trained to generate a sequence containing these sentinel tokens paired with the original text they replaced, effectively learning to 'fill in the blanks'. A loss function is computed by comparing the generated text with the ground-truth masked spans.

Google

The denoising autoencoding objective is utilized to train encoder-decoder models by requiring them to reconstruct an original, uncorrupted sequence from a corrupted input. Operating similarly to a denoising autoencoder, the encoder processes the corrupted input—such as a sequence with masked tokens—and transforms it into a hidden representation. The decoder then uses this hidden representation to predict the original text. By learning to map a corrupted sequence to its uncorrupted counterpart, the model concurrently develops two key skills: the encoder gains the ability to comprehend the input context, while the decoder acquires the capability to generate coherent text.

Training Encoder-Decoder Models with a Denoising Autoencoding Objective

Reference of Foundations of Large Language Models Course

An example of a denoising autoencoder task involving consecutive token masking is training a model to reconstruct a full sentence from a corrupted input. The model receives an input where adjacent tokens have been replaced by `[MASK]`, and its objective is to generate the original, complete sequence. For instance, the input `[CLS] The puppies are [MASK] outside [MASK] [MASK] .` would be used to train the model to produce the target output: `⟨s⟩ The puppies are frolicking outside the house .`.

Example of Denoising Task with Consecutive Token Masking

Span-Based Denoising as an Encoder-Decoder Training Objective

When training encoder-decoder models with a denoising autoencoding objective, various methods can be used to corrupt the input data. This process is crucial for training the model to reconstruct the original input. Besides the common technique of masking tokens, other corruption strategies include altering tokens to different ones or reordering them within the sequence.

Input Corruption Methods for Denoising Autoencoder Training

The training objective of a denoising autoencoder is to identify the optimal parameters for the encoder ($$\theta$$) and decoder ($$\omega$$) to minimize reconstruction error. During training, a corrupted input $$\mathbf{x}_{\mathrm{noise}}$$ is generated by adding noise to the original input $$\mathbf{x}$$. The model processes this noisy input, and the loss function—frequently chosen as cross-entropy loss—measures how effectively the decoder recovers the original $$\mathbf{x}$$. The objective to find the optimal parameters, $$\hat{\theta}$$ and $$\hat{\omega}$$, is mathematically defined as:

$$(\hat{\theta},\hat{\omega}) = \arg\min_{\theta,\omega} \mathrm{Loss}(\mathrm{Model}_{\theta,\omega}(\mathbf{x}_{\mathrm{noise}}),\mathbf{x})$$

Denoising Autoencoder Training Objective

When training an encoder-decoder model on a denoising objective, the loss is calculated across the entire output sequence. The decoder generates the target sequence one token at a time. At each generation step, a loss function, typically cross-entropy, measures the discrepancy between the model's predicted probability distribution for the next token and the actual ground-truth token. The total loss for the training example is then computed by summing or averaging these individual token-level losses over the full length of the target sequence.

Loss Calculation for Encoder-Decoder Denoising Tasks

In denoising autoencoding, corrupted text can be represented using placeholder slots for the masked or deleted tokens. The model is then trained to fill these slots with the original tokens by leveraging the surrounding context. A key benefit of this method is that using placeholders can result in shorter input sequences, which improves the computational efficiency of the training process.

Training Efficiency in Denoising Autoencoding

The Masked Language Modeling (MLM) framework offers significant flexibility for training encoder-decoder models. Different training objectives can be created by adjusting various parameters, such as the percentage of tokens that are masked and the maximum length of the text spans that are replaced by a mask token. This adaptability allows the training to range from a BERT-style objective with partial masking to a full language modeling task where the entire sequence is generated.

Flexibility of Masked Language Modeling for Encoder-Decoder Training

This example illustrates the input and output for an encoder-decoder model undergoing denoising training. The encoder processes a corrupted input sequence where specific tokens have been replaced by a mask token, represented as: `[CLS] The puppies are [MASK] outside [MASK] house .` The decoder then utilizes the hidden representation from the encoder to predict and reconstruct the original, uncorrupted sequence, yielding: `⟨s⟩ The puppies are frolicking outside the house .`

Example of a Denoising Autoencoder Task for Encoder-Decoder Models

After defining the model architecture and training objective for a denoising autoencoder, a key remaining step is to specify how the input data is corrupted. The BART model, developed by Lewis et al. (2020), exemplifies this by employing several different methods for corrupting the input sequence during its pre-training phase.

BART Model's Use of Diverse Input Corruption Methods

An encoder-decoder model is being trained using the following example:

- **Input to Encoder:** "The scientist carefully [MASK] the solution into the beaker."
- **Target Output for Decoder:** "The scientist carefully poured the solution into the beaker."

Based on this training setup, what is the primary function of the decoder?

A machine learning engineer sets up a training process for an encoder-decoder model where the encoder receives a text sequence with several words replaced by a `[MASK]` token. The model's objective is to have the decoder output the exact same sequence it received as input, including the `[MASK]` tokens. Critically evaluate this training objective. Would this approach be effective for teaching the model to generate coherent, uncorrupted text? Justify your reasoning.

Evaluating a Model Training Objective

An encoder-decoder model is being trained with the objective of reconstructing a full, original sentence from an input version where several random words have been removed. What is the most critical function of the encoder's output in this specific training paradigm?

When pre-training an encoder-decoder model using either BERT-style or denoising autoencoding methods, the initial step involves processing data through the encoder. The input provided to the encoder is a corrupted token sequence where some specific tokens have been intentionally masked and replaced with a special placeholder, such as `[MASK]` (or `[M]` for short).

Corrupted Input for Encoder-Decoder Pre-training

This diagram provides an example of training an encoder-decoder model using a denoising autoencoding objective. The process involves several key steps: 1. **Corrupted Input to Encoder:** The encoder receives a corrupted version of a sentence where some tokens are masked, for instance, `[CLS] The puppies are [MASK] outside [MASK] house .`. 2. **Sequence Reconstruction by Decoder:** The encoder generates a hidden state representation of the input, which is then passed to the decoder. The decoder's task is to reconstruct the original, uncorrupted sentence, `⟨s⟩ The puppies are frolicking outside the house .`, in an autoregressive manner. 3. **Sequence-Level Loss Calculation:** To train the model, a loss is calculated over the entire output sequence by accumulating the losses of all tokens, as in standard language modeling. This involves comparing the decoder's generated output with the ground-truth sequence, and the resulting error signal is used to update the model's parameters.

Learn Before

Related

Learn After