1Cademy - Diagnosing Cross-Lingual Representation Issues

Learn Before

Multi-lingual Pre-training for Encoder-Decoder Models

Short Answer

Diagnosing Cross-Lingual Representation Issues

A developer is pre-training a single sequence-to-sequence model for translating between Spanish and Portuguese. They observe that the model struggles to find common patterns between the two languages, treating words like 'sol' (Spanish) and 'sol' (Portuguese), which both mean 'sun', as completely unrelated concepts. Explain why creating a unified vocabulary containing tokens from both languages is a critical step to resolve this issue and enable the model to learn shared representations.

Updated 2025-10-06

Contributors are:

Who are from:

Learn Before

Related