Concept

Language-Independent Token Representations

In multi-lingual pre-training, particularly in models that use shared vocabularies, it is generally unnecessary to specify the source language of each token. While some models use explicit language embeddings to distinguish languages, this approach can make it difficult to handle code-switching, where multiple languages are mixed within the same text. Therefore, token representations in these multi-lingual models are typically assumed to be language-independent.

0

1

Updated 2026-04-18

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences