Analyze the two pre-training strategies presented in the case study. Which strategy is more likely to result in a model with superior cross-lingual capabilities? Justify your choice by explaining the fundamental difference in what the model learns from each approach.

Google

Proposed by Lample and Conneau (2019), Cross-Lingual Language Models (XLMs) are a specific approach to pre-training that leverages bilingual data. An XLM can be trained using either a causal language modeling (CLM) or a masked language modeling (MLM) objective. When using the MLM approach, the training objective is identical to BERT's, where the model, treated as an encoder, learns to predict randomly selected tokens that have been masked, replaced, or left unchanged in the input.

Cross-Lingual Language Models (XLM)

A specific technique used in bilingual pre-training involves sampling a pair of aligned sentences from two different languages. These sentences are then concatenated to form a single, combined sequence that is used as a training input. This method directly exposes the model to parallel data within a single training instance, facilitating cross-lingual learning.

Bilingual Sentence Packing for Pre-training

Pre-training Strategy for a Multilingual Model

A researcher is pre-training a multilingual model using a masked language modeling (MLM) objective. To align the pre-training process with the specific methodology of Cross-Lingual Language Models (XLMs), what is the most crucial characteristic of the input data?

Describe the key characteristic of the training data utilized in the Cross-Lingual Language Model (XLM) approach and explain the main advantage this provides for building a multilingual model.

Core Training Principle of XLM

Translation language modeling is a pre-training objective designed to align token representations across different languages. It involves concatenating sequences from two languages and replacing a certain percentage of tokens with a special mask symbol, such as `[MASK]`. The model's objective is to maximize the probability of correctly predicting these masked tokens based on the surrounding context. By doing so, the model learns to capture cross-lingual correspondences, as predicting a masked token in one language often requires leveraging information from the unmasked tokens in the other language. This cross-lingual alignment essentially enables the model to function as a translation model.

Translation Language Modeling

In the work of Lample and Conneau on cross-lingual language models, the input embedding for a specific token (denoted as $$\mathbf{e}_i$$) is calculated as the sum of its token embedding, positional embedding, and a language embedding. The inclusion of a language embedding requires assigning a language label to each token, which enables the model to distinguish between tokens from different languages.

Learn Before

Related