Essay

Diagnosing a T5-Style Model That Ignores Task Prefixes After Span-Denoising Pretraining

Your team pretrains an encoder–decoder model using span-based denoising: the encoder receives inputs where contiguous spans are replaced by unique sentinel tokens (e.g., <extra_id_0>, <extra_id_1>), and the decoder is trained to output the missing spans in order, each preceded by the corresponding sentinel token. After this pretraining, you fine-tune the same model in a text-to-text setup where every example is formatted as a single input string containing a task prefix plus content (e.g., "summarize: ", "translate English to German: ") and the target is the desired output text.

In production, you observe a consistent failure mode: regardless of the task prefix, the model often outputs fluent text that looks like it is "filling in missing pieces" of the input rather than performing the requested task (e.g., for "summarize:" it tends to restate or continue the document; for "translate:" it produces English paraphrases).

Write an analysis that (1) explains, using the roles of the encoder and decoder in an encoder–decoder network, why a span-denoising objective can bias the model toward this behavior when combined with a text-to-text task-prefix interface, and (2) proposes two concrete, implementable changes to the fine-tuning data formatting and/or training setup (not model size) that would make the task prefix more causally influential on the decoder’s outputs. For each proposed change, justify the expected mechanism of improvement and one tradeoff or risk it introduces.

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Data Science

Related