Language-Independent Token Representations
In multi-lingual pre-training, particularly in models that use shared vocabularies, it is generally unnecessary to specify the source language of each token. While some models use explicit language embeddings to distinguish languages, this approach can make it difficult to handle code-switching, where multiple languages are mixed within the same text. Therefore, token representations in these multi-lingual models are typically assumed to be language-independent.
0
1
Tags
Foundations of Large Language Models
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
A multilingual model is pre-trained on a large corpus of English and Spanish text using a single, unified vocabulary. The model processes the word 'pie', which means 'foot' in Spanish and refers to a baked dish in English. How will this word most likely be represented within the model's vocabulary structure?
Trade-offs of a Unified Vocabulary in Multilingual Models
In a multilingual model pre-trained on English and German, the shared vocabulary is structured into two distinct sections, one for English tokens and one for German tokens, to prevent interference between the languages.
Language-Independent Token Representations
Example of Code-Switching between Chinese and English
Models of Code Switching
Why Speakers Code-Switch
Example of Code-Switching between Chinese and English
Benefit of Multilingual Pre-trained Models: Handling Code-Switching
A user is sending text messages that mix two different languages. Which of the following messages best exemplifies the practice of alternating between languages within a single, coherent thought or sentence?
Diagnosing NLP Model Failure
Defining and Illustrating Code-Switching
Language-Independent Token Representations
Language-Independent Token Representations