Learn Before
Concept
Preprocessing Machine Translation Datasets
After downloading the raw text of a machine translation dataset, several preprocessing steps are necessary to standardize the format and reduce noise before tokenization. Common preprocessing techniques include replacing non-breaking spaces with standard spaces, converting all uppercase letters to lowercase, and inserting spaces between words and punctuation marks so that punctuation symbols are subsequently treated as separate tokens.
0
1
Updated 2026-05-14
Tags
D2L
Dive into Deep Learning @ D2L