Learn Before
Trade-offs of a Unified Vocabulary in Multilingual Models
A team is developing a new multilingual language model to support 10 different languages, including languages with different scripts (e.g., Latin, Cyrillic, Arabic). They decide to use a single, unified vocabulary for all languages. Analyze the primary advantage and the primary disadvantage of this shared vocabulary approach.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A multilingual model is pre-trained on a large corpus of English and Spanish text using a single, unified vocabulary. The model processes the word 'pie', which means 'foot' in Spanish and refers to a baked dish in English. How will this word most likely be represented within the model's vocabulary structure?
Trade-offs of a Unified Vocabulary in Multilingual Models
In a multilingual model pre-trained on English and German, the shared vocabulary is structured into two distinct sections, one for English tokens and one for German tokens, to prevent interference between the languages.
Language-Independent Token Representations
Example of Code-Switching between Chinese and English