Multi-lingual BERT (mBERT) is a version of BERT trained on text from 104 different languages. Its main distinction from monolingual BERT is the use of a significantly larger vocabulary to accommodate tokens from this diverse set of languages. This design allows mBERT to map representations from different languages into a common vector space, which enables the model to share and transfer knowledge across languages.

Google

Since the original BERT model was developed primarily for English, two main strategies have emerged to extend its capabilities to other languages. The first approach involves creating separate, dedicated models for each individual language. The second, more common approach, is to train a single multilingual model using a combined dataset from all targeted languages.

Approaches to Extending BERT for Multilingual Support

Reference of Foundations of Large Language Models Course

Multi-lingual BERT (mBERT)

Learning multilingual text representations shared across languages plays an important role in many cross-lingual NLP tasks.
Examples:

- Multilingual BERT3 (mBERT): It is pre-trained by MLM with the shared vocabulary and weights on Wikipedia text from the top 104 languages. Each training sample is a monolingual document, and there are no cross-lingual objectives specifically designed nor any cross-lingual data. Even so, mBERT performs cross-lingual generalization surprisingly well.

- Cross-Lingual Language Model (XLM): XLM improves mBERT by incorporating a cross-lingual task, translation language modeling (TLM), which performs MLM on a concatenation of parallel bilingual sentence pairs.

Although multilingual PTMs perform well on many languages, recent work showed that PTMs trained on a single language significantly outperform the multilingual results. For Chinese, which does not have explicit word boundaries, modeling larger granularity and multigranularity word representations have shown great success.
Some monolingual PTMs have been released for different languages, such as CamemBERT and FlauBERT for French, Fin-BERT for Finnish, BERTje and RobBERT for
Dutch, AraBERT for Arabic language.

Learn Before

Related