Learn Before
TimeMachine Build Pipeline
The TimeMachine class provides a build method that combines all text-to-sequence conversion steps into a single end-to-end pipeline. Given the raw text and an optional pre-existing vocabulary, build first preprocesses the text (lowercasing and stripping non-alphabetical characters), then tokenizes the cleaned text into individual characters. If no vocabulary is supplied, it constructs one from the resulting tokens using the Vocab class. Finally, it maps every token to its corresponding integer index via the vocabulary, producing two outputs: corpus, a flat list of integer token indices, and vocab, the vocabulary object. This unified method ensures that preprocessing, tokenization, and numericalization are performed in a consistent, reproducible sequence.
0
1
Tags
D2L
Dive into Deep Learning @ D2L