An engineer is training an encoder-decoder model with a span-based denoising objective. The input to the model's encoder is: 'The model <mask_A> to fill in the <mask_B> spans.' The engineer is unsure how to format the target sequence for the decoder. They consider two options:

*   **Option 1:** 'The model learns to fill in the missing text spans.'
*   **Option 2:** '<mask_A> learns <mask_B> missing text'

Which option is the correct target for this specific training objective, and why is the other option incorrect or less efficient for this task?

Google

In a span-based denoising task for an encoder-decoder model, the training objective is to reconstruct only the original text from masked spans. The model's encoder processes an input sequence where one or more spans of text have been replaced by unique mask or sentinel tokens. The decoder is then trained to generate a sequence containing these sentinel tokens paired with the original text they replaced, effectively learning to 'fill in the blanks'. A loss function is computed by comparing the generated text with the ground-truth masked spans.

Span-Based Denoising as an Encoder-Decoder Training Objective

An encoder-decoder model is being trained with a span-based denoising objective. The encoder is given the following corrupted input text: 'To learn about the solar system, we first study <mask_0> and then move on to <mask_1> planets.' The original, uncorrupted text for the masked spans is '<mask_0>' = 'the Sun' and '<mask_1>' = 'the other'. What should the target output sequence for the decoder be in this training step?

An encoder-decoder model is trained on a denoising task where corrupted text is provided as input. One training approach requires the decoder to reconstruct the *entire* original, uncorrupted text. A second, 'span-based' approach requires the decoder to generate *only* the text that was masked in the input, preceded by special tokens indicating which mask is being filled. Explain the key difference in the target sequence length for the decoder in the span-based approach and identify a primary computational benefit of this method.

Analysis of Denoising Training Objectives

Debugging a Span-Based Denoising Training Pipeline

Your team is pretraining an internal T5-style enco...

Your company wants one internal model to support m...

Your team is pretraining an internal T5-style mode...

Your team is building a single internal T5-style t...

Your team pretrains an encoder–decoder model using span-based denoising: the encoder receives inputs where contiguous spans are replaced by unique sentinel tokens (e.g., <extra_id_0>, <extra_id_1>), and the decoder is trained to output the missing spans in order, each preceded by the corresponding sentinel token. After this pretraining, you fine-tune the same model in a text-to-text setup where every example is formatted as a single input string containing a task prefix plus content (e.g., "summarize: <document>", "translate English to German: <sentence>") and the target is the desired output text.

In production, you observe a consistent failure mode: regardless of the task prefix, the model often outputs fluent text that looks like it is "filling in missing pieces" of the input rather than performing the requested task (e.g., for "summarize:" it tends to restate or continue the document; for "translate:" it produces English paraphrases).

Write an analysis that (1) explains, using the roles of the encoder and decoder in an encoder–decoder network, why a span-denoising objective can bias the model toward this behavior when combined with a text-to-text task-prefix interface, and (2) proposes two concrete, implementable changes to the fine-tuning data formatting and/or training setup (not model size) that would make the task prefix more causally influential on the decoder’s outputs. For each proposed change, justify the expected mechanism of improvement and one tradeoff or risk it introduces.

Diagnosing a T5-Style Model That Ignores Task Prefixes After Span-Denoising Pretraining

You lead an applied NLP team building a single internal “language workbench” model for three enterprise workflows: (1) generate a 1–2 sentence customer-support ticket summary, (2) extract a JSON-like list of product names mentioned in a ticket, and (3) rewrite a ticket into a more polite tone. Leadership wants one model that can do all three by changing only a textual instruction prefix (e.g., “summarize: …”, “extract products: …”, “rewrite politely: …”).

You have budget for either (A) large-scale self-supervised pretraining of an encoder–decoder model using span-based denoising (mask contiguous spans in the input with sentinel tokens and train the decoder to output the missing spans with those sentinels), followed by light fine-tuning on a small labeled set for each workflow, or (B) no denoising pretraining, but heavier supervised fine-tuning on larger labeled sets for each workflow.

Write an evaluation memo recommending A or B. Your memo must explicitly connect: (i) how the encoder–decoder architecture supports instruction-conditioned text-to-text behavior across these heterogeneous tasks, and (ii) how span-based denoising changes what the encoder and decoder learn (and why that matters for both generation tasks like summarization/rewriting and “structured” generation like product extraction). Include at least two concrete risks/tradeoffs of your chosen option (e.g., failure modes, data requirements, output controllability), and propose one mitigation for each risk.

Choosing Between Span-Denoising Pretraining and Task-Specific Fine-Tuning in a T5-Style Text-to-Text System

You lead an internal platform team that wants to standardize three product features on a single model: (1) customer-email summarization, (2) sentiment classification of the email (output should be a label like "positive"/"negative"), and (3) extracting the top 3 action items as short bullet phrases. You are considering a T5-style approach.

Write an essay that proposes (a) how you would represent all three features in a single text-to-text interface (i.e., what the input and output strings would look like, including how the model is instructed), and (b) how span-based denoising pretraining for an encoder–decoder network supports this unified approach.

In your answer, explicitly connect the encoder’s role, the decoder’s role, and the span-masking/sentinel-token target format to why the same architecture can handle both “generate long text” (summaries) and “generate short text” (labels/action items). Also discuss at least one practical tradeoff or failure mode you would anticipate if the text-to-text prompts or the denoising objective are poorly aligned with the downstream tasks.

Designing a Unified Text-to-Text Model and Pretraining Objective for Multiple NLP Features

You are rolling out a single internal NLP service based on an encoder–decoder model in a T5-style text-to-text setup. Every request is formatted as plain text with an instruction prefix (e.g., "summarize:", "translate en->de:", "extract entities:") followed by the user content, and the model always generates a text output.

After pretraining, the team reports a consistent failure mode across multiple downstream tasks: outputs are fluent and on-topic for the *instruction*, but they often ignore key facts from the provided input. For example:
- Input: "summarize: The incident report states the outage lasted 17 minutes and affected only EU customers." Output: "A brief outage impacted customers for about an hour across multiple regions."
- Input: "extract entities: Contract signed by Acme Corp on 2024-01-12 for $2.3M." Output: "Acme Corp; 2023-12-01; $3.0M"

You inspect the pretraining pipeline and find it uses span-based denoising with sentinel tokens, but the data engineer implemented the decoder target as the *entire original uncorrupted text* (i.e., the decoder is trained to reproduce the full input sequence), rather than the standard T5-style target that concatenates only the missing spans with their sentinel tokens.

As the model owner, analyze how this specific pretraining-target mistake would change what the encoder and decoder learn in an encoder–decoder network, and explain why that would plausibly lead to the observed "instruction-following but input-unfaithful" behavior in a text-to-text system. Provide one concrete correction to the pretraining objective/format that would directly address the issue.

Root-Cause Analysis of a T5-Style Model Producing Fluent but Unfaithful Outputs

You are designing a single internal NLP service for a regulated enterprise that must support multiple text-in/text-out features behind one API: (1) "summarize" long incident reports into 3 bullet points, (2) "extract" a comma-separated list of product names mentioned in a customer email, and (3) "rewrite" a draft response to be more formal. The platform team wants one model that can switch behavior based on a textual instruction prefix (e.g., "summarize:", "extract:", "rewrite:") and can be pre-trained on a large corpus of unlabeled internal documents before any task-specific fine-tuning.

A prototype team proposes using an encoder-only model with a classification head for extraction and a separate decoder-only model for summarization/rewriting, arguing it will be simpler. Another team proposes a single encoder-decoder model trained in a T5-style text-to-text framework, using span-based denoising pretraining (masking contiguous spans in the input with sentinel tokens and training the decoder to output the missing spans) before fine-tuning on the three tasks.

As the technical reviewer, which proposal would you approve and why? In your answer, explicitly connect (a) how the text-to-text instruction prefix interacts with the chosen architecture, and (b) how span-based denoising pretraining prepares (or fails to prepare) the model for all three downstream behaviors within one system.

Selecting an Architecture and Pretraining Objective for a Unified Internal NLP Service

You are rolling out an internal, single-model NLP service based on a T5-style text-to-text approach. The model is an encoder–decoder network that was pre-trained with span-based denoising using sentinel tokens (e.g., <extra_id_0>, <extra_id_1>) and then fine-tuned on multiple tasks using textual task prefixes.

After deployment, two issues appear:
1) For classification-style requests (e.g., "sentiment: <text>"), the model often outputs strings that look like "<extra_id_0> positive" or "<extra_id_0> negative" instead of just "positive"/"negative".
2) For generation-style requests (e.g., "summarize: <article>"), the model sometimes inserts sentinel tokens into the summary.

A teammate proposes a quick fix: "Strip any <extra_id_*> tokens from the model output at inference time and ship." Another teammate argues the root cause is in how inputs/targets are being constructed for fine-tuning and that the fix should be in the text-to-text formatting and training pipeline.

As the reviewer, analyze which teammate is more correct and justify your decision by explaining (a) how span-based denoising trains an encoder–decoder model to use sentinel tokens, (b) how the text-to-text task prefix + target formatting should differ between denoising pretraining and downstream fine-tuning, and (c) one concrete change you would make to the fine-tuning data (source/target strings) to prevent sentinel-token leakage without relying on post-processing.

Learn Before

Related