After applying the full text-processing pipeline (preprocessing, character-level tokenization, and vocabulary construction) to H. G. Wells' The Time Machine, the resulting corpus contains $$173{,}428$$ token indices and the vocabulary comprises $$28$$ unique tokens. These $$28$$ tokens correspond to the $$26$$ lowercase English letters, the space character, and one special unknown-token symbol. The large ratio of corpus length to vocabulary size is characteristic of character-level tokenization, where the vocabulary is inherently small but every individual character in the text contributes a separate entry to the corpus.

Claude

The TimeMachine class provides a build method that combines all text-to-sequence conversion steps into a single end-to-end pipeline. Given the raw text and an optional pre-existing vocabulary, build first preprocesses the text (lowercasing and stripping non-alphabetical characters), then tokenizes the cleaned text into individual characters. If no vocabulary is supplied, it constructs one from the resulting tokens using the Vocab class. Finally, it maps every token to its corresponding integer index via the vocabulary, producing two outputs: corpus, a flat list of integer token indices, and vocab, the vocabulary object. This unified method ensures that preprocessing, tokenization, and numericalization are performed in a consistent, reproducible sequence.

TimeMachine Build Pipeline

Dive into Deep Learning

In some text datasets a corpus of token indices is stored as a single, flat list rather than a nested list of per-line or per-sentence token lists. This design is appropriate when the source text does not have meaningful sentence or paragraph boundaries—for example, when each line of the original file is not guaranteed to represent a complete sentence. Flattening all tokens into one continuous sequence treats the entire text as a single stream of characters (or words), which simplifies downstream operations such as generating fixed-length training subsequences for language models.

Learn Before

Related