Diagnosing Cross-Lingual Representation Issues
A developer is pre-training a single sequence-to-sequence model for translating between Spanish and Portuguese. They observe that the model struggles to find common patterns between the two languages, treating words like 'sol' (Spanish) and 'sol' (Portuguese), which both mean 'sun', as completely unrelated concepts. Explain why creating a unified vocabulary containing tokens from both languages is a critical step to resolve this issue and enable the model to learn shared representations.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A team is building a single encoder-decoder model intended to translate between Japanese, Korean, and Mandarin. They pre-train the model on a large, combined corpus of all three languages. However, instead of creating a unified vocabulary that includes tokens from all three languages, they use three separate, language-specific vocabularies. What is the most direct and critical consequence of this design choice on the model's translation performance?
Evaluating Pre-training Strategies for a Bilingual Model
Diagnosing Cross-Lingual Representation Issues