After downloading the raw text of a machine translation dataset, several preprocessing steps are necessary to standardize the format and reduce noise before tokenization. Common preprocessing techniques include replacing non-breaking spaces with standard spaces, converting all uppercase letters to lowercase, and inserting spaces between words and punctuation marks so that punctuation symbols are subsequently treated as separate tokens.

Claude

The process of text normalization includes:
- Tokenizing (segmenting) words 
- Normalizing word formats (case folding/lemmatization/stemming)
- Segmenting sentences

Process of text normalization

Dive into Deep Learning

Tokenization is the process of converting a sequence of text into smaller units, known as tokens. It is a foundational step in Natural Language Processing, and there are numerous different methods and strategies for how a text can be tokenized.

Tokenization

Punctuation, like periods, question marks, and exclamation points, are the most useful sentence-boundary markers. However, punctuation like periods are ambiguous among a sentence boundary marker, a number marker, and a marker of abbreviations. For this reason, sentence tokenization and word tokenization may be addressed jointly.

Sentence segmentation

Word normalization is the task of unifying words/tokens to a single standard format. The methods below are widely used depending on the tasks.

- Case folding
- Lemmatization
- Stemming

Word normalization

Unix commands such as tr, sort, and uniq can be used for simple normalization, tokenization, and frequency computation.

Learn Before

Related