Concept

Preprocessing Machine Translation Datasets

After downloading the raw text of a machine translation dataset, several preprocessing steps are necessary to standardize the format and reduce noise before tokenization. Common preprocessing techniques include replacing non-breaking spaces with standard spaces, converting all uppercase letters to lowercase, and inserting spaces between words and punctuation marks so that punctuation symbols are subsequently treated as separate tokens.

0

1

Updated 2026-05-14

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L