Simplified Raw Text Preprocessing
After loading raw text into memory, it is commonly preprocessed to standardize the format and reduce vocabulary complexity before feeding it into sequence models. A straightforward preprocessing strategy uses a regular expression to replace every non-alphabetical character (including punctuation, digits, and whitespace) with a single space, and then converts all remaining characters to lowercase. For instance, applying the pattern [^A-Za-z]+ followed by lowercasing transforms 'The Time Machine, by H. G. Wells [1898]' into 'the time machine by h g wells '. This produces a uniform stream of space-separated lowercase words, eliminating noise from punctuation and capitalization that is unnecessary for many language modeling tasks.
0
1
Tags
D2L
Dive into Deep Learning @ D2L