Learn Before
Shared Vocabulary in Multilingual Models
In multilingual pre-trained models, tokens from different languages are not explicitly identified by their source language. Instead, they are all treated as entries within a single, unified vocabulary. This approach effectively creates a composite 'language' that includes the vocabularies of all processed languages, allowing the model to handle multilingual text seamlessly.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Cross-Lingual Learning
Bilingual Pre-training for Multilingual Models
Benefit of Multilingual Pre-trained Models: Handling Code-Switching
Shared Vocabulary in Multilingual Models
Factors Influencing Multilingual Pre-training
A company is developing a sentiment analysis tool. Their primary market is in France, for which they have a massive, high-quality dataset. They also need to provide functional support for Spanish and German, but have very limited data for these languages. The highest priority is achieving state-of-the-art performance for the French market, while still being able to handle the other languages. Given these requirements, which strategy for choosing a foundational model is most appropriate?
Model Selection for a Monolingual Task
Match each pre-trained model with the description that best characterizes its training methodology and primary use case.
Learn After
A multilingual model is pre-trained on a large corpus of English and Spanish text using a single, unified vocabulary. The model processes the word 'pie', which means 'foot' in Spanish and refers to a baked dish in English. How will this word most likely be represented within the model's vocabulary structure?
Trade-offs of a Unified Vocabulary in Multilingual Models
In a multilingual model pre-trained on English and German, the shared vocabulary is structured into two distinct sections, one for English tokens and one for German tokens, to prevent interference between the languages.
Language-Independent Token Representations
Example of Code-Switching between Chinese and English