Case Study

Root-Cause Analysis of Pre-training Objective Leakage and Coherence Failures

You lead an internal team pre-training an LLM for a corporate knowledge assistant that must (1) generate long, coherent incident summaries and (2) support sentence-pair tasks such as “Does this policy clause entail that control requirement?” During a red-team exercise, the model shows two issues: (a) when asked to fill a missing word in the middle of a sentence, it often produces plausible but inconsistent completions, and (b) when generating multi-paragraph summaries, it sometimes contradicts earlier sentences. Your training pipeline used an encoder-only Transformer with a masked-token objective plus a binary classifier trained to predict whether Sentence B follows Sentence A; you did not train any left-to-right next-token objective. A colleague proposes switching to an encoder-decoder denoising objective (reconstruct clean text from corrupted text) and adding a permuted prediction objective; another colleague proposes replacing everything with a pure causal (left-to-right) language modeling objective.

As the decision-maker, which proposal (or combination) would you choose to address BOTH issues with the least mismatch to the two product requirements, and why? In your answer, explicitly connect how the information available at training time (bidirectional masking vs sentence-pair classification vs left-to-right generation vs reconstruction-from-noise vs permuted prediction order) would causally affect (a) mid-sentence infilling quality and (b) long-form generation consistency.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

What is BERT?

Data Science

Ch.4 Alignment - Foundations of Large Language Models

Related