Example

Example of a Packed Bilingual Sentence Sequence

To prepare bilingual data for a model, an aligned sentence pair, such as the Chinese '鲸鱼 是 哺乳 动物 。' and its English translation 'Whales are mammals .', is concatenated into a single input sequence. Special tokens are used to structure this sequence: [CLS] marks the beginning, and [SEP] separates the two sentences, with another [SEP] at the end. The final packed sequence is: [CLS] 鲸鱼 是 哺乳 动物 。 [SEP] Whales are mammals . [SEP].

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course