Concept

Multilingual and Language-Specific PTMs

Learning multilingual text representations shared across languages plays an important role in many cross-lingual NLP tasks. Examples:

  • Multilingual BERT3 (mBERT): It is pre-trained by MLM with the shared vocabulary and weights on Wikipedia text from the top 104 languages. Each training sample is a monolingual document, and there are no cross-lingual objectives specifically designed nor any cross-lingual data. Even so, mBERT performs cross-lingual generalization surprisingly well.

  • Cross-Lingual Language Model (XLM): XLM improves mBERT by incorporating a cross-lingual task, translation language modeling (TLM), which performs MLM on a concatenation of parallel bilingual sentence pairs.

Although multilingual PTMs perform well on many languages, recent work showed that PTMs trained on a single language significantly outperform the multilingual results. For Chinese, which does not have explicit word boundaries, modeling larger granularity and multigranularity word representations have shown great success. Some monolingual PTMs have been released for different languages, such as CamemBERT and FlauBERT for French, Fin-BERT for Finnish, BERTje and RobBERT for Dutch, AraBERT for Arabic language.

0

1

Updated 2026-05-02

Tags

Data Science

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences