Essay

Choosing a Pre-training Objective Under Data Constraints and Deployment Needs

You lead an internal team pre-training a language model for two production uses: (1) a contract-review assistant that must accurately fill in missing clauses inside long documents (insertion and span infilling), and (2) a customer-support agent that must generate responses token-by-token with low latency. Your training data is a mix of clean policy documents and noisy OCR’d PDFs where words are frequently dropped, duplicated, or out of order. Compute budget allows only ONE primary pre-training objective (you may add at most one lightweight auxiliary loss).

Write a recommendation memo that:

  • Selects the primary objective from: masked language modeling, causal language modeling, denoising autoencoder reconstruction, or permuted language modeling.
  • Justifies the choice by explicitly analyzing how the objective’s information flow (bidirectional vs left-to-right vs permuted) interacts with (a) the OCR noise pattern and (b) the two deployment requirements (infilling vs streaming generation).
  • Proposes one auxiliary objective (either next sentence prediction or none) and argues for/against it, including at least one concrete risk of relying on superficial cues.

Your answer should make clear tradeoffs (what you gain and what you give up) and explain why your chosen combination is the best fit for this scenario.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

What is BERT?

Data Science

Ch.4 Alignment - Foundations of Large Language Models

Related