1Cademy - Shared Vocabulary in Multilingual Models

Learn Before

Multilingual and Language-Specific PTMs

Concept

Shared Vocabulary in Multilingual Models

In multilingual pre-trained models, tokens from different languages are not explicitly identified by their source language. Instead, they are all treated as entries within a single, unified vocabulary. This approach effectively creates a composite 'language' that includes the vocabularies of all processed languages, allowing the model to handle multilingual text seamlessly.

Updated 2026-04-18

Contributors are: