After completing the Masked Language Modeling pre-training phase, the model yields optimized parameters $$\widehat{\mathbf{W}}$$ (the parameters of the prediction head used for the masked token task) and $$\hat{\theta}$$ (the core encoder parameters). To transition to downstream applications, the prediction head parameters $$\widehat{\mathbf{W}}$$ are dropped. The resulting pre-trained encoder, denoted as $$\mathrm{Encoder}_{\hat{\theta}}(\cdot)$$, can then be either directly applied to downstream tasks or further fine-tuned with task-specific datasets.

Google

Masked Language Modeling (MLM) is a highly popular pre-training method for encoders and forms the foundation for models like BERT. The core principle involves creating a prediction task by masking some tokens in an input sequence. The model is then trained to predict these original, masked tokens by leveraging the surrounding unmasked tokens as context. This process forces the model to develop a deep, bidirectional understanding of language by considering both left and right contexts.

Masked Language Modeling

Reference of Foundations of Large Language Models Course

An example of a two-sentence input formatted for a model like BERT includes special tokens to delineate the structure. The sequence `[CLS] It is raining . [SEP] I need an umbrella . [SEP]` demonstrates this format, where `[CLS]` is a special classification token marking the beginning of the input and `[SEP]` is a separator token used to distinguish between the two separate sentences.

Example of a Two-Sentence Input for BERT

BERT's Masked Language Model (MLM) is trained using a specific data corruption process. First, 15% of the tokens in an input sequence are randomly selected as prediction targets. Then, these selected tokens are modified according to a fixed distribution: 80% are replaced with a special `[MASK]` token, 10% are replaced with a random token from the vocabulary, and the remaining 10% are left unchanged. This strategy creates a 'noisy' version of the input. The Transformer encoder processes this corrupted sequence, and the model's objective is to predict the original, unmodified tokens based on the output hidden states of the selected positions.

BERT's Masked Language Model Pre-training Process

A language model is trained on a large corpus of text. During this training, it is frequently presented with sentences where a single word has been hidden, such as: 'The scientist carefully examined the sample under the [HIDDEN]'. The model's sole objective is to predict the original, hidden word. What is the most significant advantage of this training objective for the model's understanding of language?

Consider the sentence: 'The driver turned the steering ___ to avoid the obstacle.' A language model is tasked with predicting the missing word. Explain how a model trained using a masked language modeling objective would leverage the available information differently than a model that can only process the sentence from left to right. Specifically, what part of the sentence provides crucial context that the left-to-right model would miss?

Bidirectional Context in Language Modeling

Based on the training objective described in the case study, analyze the type of contextual understanding the model is forced to develop. Explain what information from the surrounding words the model must leverage to accurately predict the original word.

Analysis of a Language Model Training Objective

You lead an internal LLM initiative for a company with two high-value use cases: (1) a customer-support assistant that must generate policy-compliant, fluent responses (long-form text generation), and (2) an enterprise search/reranking system that must accurately judge whether two pieces of text belong together (e.g., a ticket description and a proposed resolution, or a question and a candidate answer). You have a fixed pre-training budget and must choose ONE primary pre-training objective and optionally ONE auxiliary objective from the following families: masked token reconstruction (masked language modeling), sentence-pair coherence classification (next sentence prediction), left-to-right next-token prediction (causal language modeling), reconstructing clean text from a corrupted input using an encoder-decoder denoising objective, and predicting tokens under a random permutation order (permuted language modeling).

Write a recommendation memo that (a) selects your primary objective and optional auxiliary objective, (b) explains how the information flow/conditioning structure of your chosen objectives will shape what the model learns for BOTH use cases, and (c) explicitly discusses at least two tradeoffs you are accepting (e.g., bidirectional understanding vs generation behavior, exposure to [MASK]/corruption artifacts vs realistic inference-time inputs, learning inter-sentence relations vs relying on superficial cues, or permutation-based context vs strict left-to-right factorization). Your answer should make clear why your chosen combination is better than at least one plausible alternative combination for this company scenario.

Selecting a Pre-training Objective Mix for a Corporate LLM

You lead an internal team building an LLM-based assistant for a regulated enterprise. After pre-training and light fine-tuning, you observe three recurring failures in pilot use:

1) In long-form drafting, the model often contradicts itself across paragraphs and sometimes “jumps” to content that would only make sense if it had seen later text.
2) In document review, the model is strong at filling in missing words inside a sentence, but it is unreliable at deciding whether a proposed follow-up sentence is a coherent continuation of the previous one (e.g., it misses topic shifts and non sequiturs).
3) In reconstruction-style tasks (e.g., restoring a partially corrupted policy excerpt), the model sometimes produces fluent text that does not faithfully match the original wording, even when the corruption is minor.

Assume you can choose among these pre-training objectives (alone or in combination): masked token prediction, next-sentence relationship classification, left-to-right next-token prediction, denoising reconstruction from corrupted inputs using an encoder-decoder, and permuted-order token prediction.

Write a recommendation memo that (a) identifies the most likely objective-level causes of the three failures, and (b) proposes a revised objective mix (and why) that would reduce all three issues. Your answer must explicitly connect each proposed change to how information is (or is not) allowed to flow during training (bidirectional context vs strictly past context vs permuted context) and to the difference between predicting a few tokens vs reconstructing an entire sequence vs classifying sentence-pair coherence.

Diagnosing Pre-training Objective Mismatch from Product Failures

You lead an internal team pre-training a language model for two production uses: (1) a contract-review assistant that must accurately fill in missing clauses inside long documents (insertion and span infilling), and (2) a customer-support agent that must generate responses token-by-token with low latency. Your training data is a mix of clean policy documents and noisy OCR’d PDFs where words are frequently dropped, duplicated, or out of order. Compute budget allows only ONE primary pre-training objective (you may add at most one lightweight auxiliary loss).

Write a recommendation memo that:
- Selects the primary objective from: masked language modeling, causal language modeling, denoising autoencoder reconstruction, or permuted language modeling.
- Justifies the choice by explicitly analyzing how the objective’s information flow (bidirectional vs left-to-right vs permuted) interacts with (a) the OCR noise pattern and (b) the two deployment requirements (infilling vs streaming generation).
- Proposes one auxiliary objective (either next sentence prediction or none) and argues for/against it, including at least one concrete risk of relying on superficial cues.

Your answer should make clear tradeoffs (what you gain and what you give up) and explain why your chosen combination is the best fit for this scenario.

Choosing a Pre-training Objective Under Data Constraints and Deployment Needs

You lead model strategy for an internal enterprise assistant used by legal and compliance teams. The assistant must (1) generate long, coherent policy drafts and email responses, (2) answer questions that require understanding relationships across two adjacent paragraphs (e.g., “Does the exception in paragraph 2 apply to the rule in paragraph 1?”), and (3) be robust to noisy inputs from OCR and copy/paste (missing words, duplicated phrases, and occasional sentence reordering). You have budget to pre-train ONE base model from scratch and can choose ONE primary pre-training objective from the following families: causal language modeling (left-to-right next-token prediction), masked language modeling (predict masked tokens using both left and right context), denoising autoencoder reconstruction (reconstruct clean text from a corrupted version), permuted language modeling (predict tokens in a random permutation order), and optionally add Next Sentence Prediction (NSP) as an auxiliary loss.

Case study task: Choose the single primary objective you would use and decide whether you would include NSP as an auxiliary loss. Justify your choices by explicitly explaining the tradeoffs among (a) generation quality for long outputs, (b) bidirectional understanding within a span, (c) modeling cross-sentence/paragraph relationships, and (d) robustness to the specified noise patterns. Your answer should make clear why at least two of the non-chosen objectives are less suitable given these constraints.

Selecting a Pre-training Objective for a Regulated Enterprise Assistant

You lead an internal team pre-training an LLM for a corporate knowledge assistant that must (1) generate long, coherent incident summaries and (2) support sentence-pair tasks such as “Does this policy clause entail that control requirement?” During a red-team exercise, the model shows two issues: (a) when asked to fill a missing word in the middle of a sentence, it often produces plausible but inconsistent completions, and (b) when generating multi-paragraph summaries, it sometimes contradicts earlier sentences. Your training pipeline used an encoder-only Transformer with a masked-token objective plus a binary classifier trained to predict whether Sentence B follows Sentence A; you did not train any left-to-right next-token objective. A colleague proposes switching to an encoder-decoder denoising objective (reconstruct clean text from corrupted text) and adding a permuted prediction objective; another colleague proposes replacing everything with a pure causal (left-to-right) language modeling objective.

As the decision-maker, which proposal (or combination) would you choose to address BOTH issues with the least mismatch to the two product requirements, and why? In your answer, explicitly connect how the information available at training time (bidirectional masking vs sentence-pair classification vs left-to-right generation vs reconstruction-from-noise vs permuted prediction order) would causally affect (a) mid-sentence infilling quality and (b) long-form generation consistency.

Root-Cause Analysis of Pre-training Objective Leakage and Coherence Failures

You lead an internal team building an enterprise writing assistant that must (1) generate long, coherent policy drafts from a short prompt, (2) accurately fill in missing clauses inside existing documents during redlining, and (3) decide whether a proposed paragraph logically follows the previous paragraph in a policy (to flag non-sequiturs). You have budget to pre-train ONE base model from scratch and can choose ONE primary training objective; you may optionally add ONE auxiliary objective if you can justify why it complements the primary objective without undermining it. The candidate objectives are: causal language modeling, masked language modeling, denoising autoencoder reconstruction (corrupt input then reconstruct), permuted language modeling, and next sentence prediction.

Case constraints: Your training corpus is mostly internal policies and emails; at inference time the assistant must support both free-form generation and in-place editing. Early prototypes show two failure modes you must address: (A) generated drafts are locally fluent but drift off-topic after ~2 pages, and (B) in-place clause completion is often grammatically correct but contradicts a constraint stated later in the same paragraph.

Which objective would you choose as the PRIMARY objective, and which (if any) would you add as an AUXILIARY objective? In your answer, explicitly connect your choice to how information is (or is not) available to the model during training (bidirectional vs. left-to-right vs. permuted/denoising), and explain how your choice mitigates BOTH failure modes (A) and (B) while still supporting requirement (3) about paragraph-to-paragraph coherence.

Pre-training Objective Choice for a Multi-Modal Enterprise Writing Assistant

Your team is pre-training an internal LLM for a co...

Your team is building an internal model that must ...

Your team is pre-training a text model for an inte...

Your team is pre-training an internal LLM to suppo...

Transitioning from Masked Language Modeling to Downstream Tasks

The symbol $$[\mathrm{MASK}]$$ is utilized to represent hidden tokens. Its mathematical representation, denoted as $$\mathbf{e}_{\mathrm{mask}}$$, is an embedding that is formed by a combination of the token embedding for the $$[\mathrm{MASK}]$$ symbol and its corresponding positional embedding.

Embedding of the MASK Symbol

The masked language modeling approach can be generalized to encompass both BERT-style bidirectional training and standard autoregressive language modeling. By varying the percentage of masked tokens in the input text, the objective can shift. For instance, if all tokens in a sequence are masked, the model's task becomes generating the entire sequence from scratch, effectively mirroring standard language modeling.

Generalization of Masked Language Modeling to Autoregressive Modeling

To illustrate how masking can simulate standard language modeling, consider an input sequence where every token is replaced by a mask: `[CLS] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK]`. The model is then trained to predict and generate the entire corresponding sequence, such as `<s> The puppies are frolicking outside the house .`, effectively performing autoregressive text generation from an entirely masked context.

Learn Before

Related