In some text datasets a corpus of token indices is stored as a single, flat list rather than a nested list of per-line or per-sentence token lists. This design is appropriate when the source text does not have meaningful sentence or paragraph boundaries—for example, when each line of the original file is not guaranteed to represent a complete sentence. Flattening all tokens into one continuous sequence treats the entire text as a single stream of characters (or words), which simplifies downstream operations such as generating fixed-length training subsequences for language models.

Flat Corpus Representation

After applying the full text-processing pipeline (preprocessing, character-level tokenization, and vocabulary construction) to H. G. Wells' The Time Machine, the resulting corpus contains $$173{,}428$$ token indices and the vocabulary comprises $$28$$ unique tokens. These $$28$$ tokens correspond to the $$26$$ lowercase English letters, the space character, and one special unknown-token symbol. The large ratio of corpus length to vocabulary size is characteristic of character-level tokenization, where the vocabulary is inherently small but every individual character in the text contributes a separate entry to the corpus.

Time Machine Corpus Statistics

The TimeMachine class provides a build method that combines all text-to-sequence conversion steps into a single end-to-end pipeline. Given the raw text and an optional pre-existing vocabulary, build first preprocesses the text (lowercasing and stripping non-alphabetical characters), then tokenizes the cleaned text into individual characters. If no vocabulary is supplied, it constructs one from the resulting tokens using the Vocab class. Finally, it maps every token to its corresponding integer index via the vocabulary, producing two outputs: corpus, a flat list of integer token indices, and vocab, the vocabulary object. This unified method ensures that preprocessing, tokenization, and numericalization are performed in a consistent, reproducible sequence.

Claude

The initial step in preparing text for sequence modeling is to read the raw text from a dataset into memory as a continuous string. For example, a complete text file, such as a book, can be downloaded and loaded as a single sequence of characters before any further processing occurs.

Reading Raw Text for Sequence Data

Dive into Deep Learning

The Time Machine dataset consists of H. G. Wells' book *The Time Machine*, which contains just over 30,000 words. It is used as a small, introductory text dataset to demonstrate the fundamental steps of reading and preprocessing raw text for sequence data models before scaling up to the significantly larger datasets typical of real-world applications.

The Time Machine Dataset

After loading raw text into memory, it is commonly preprocessed to standardize the format and reduce vocabulary complexity before feeding it into sequence models. A straightforward preprocessing strategy uses a regular expression to replace every non-alphabetical character (including punctuation, digits, and whitespace) with a single space, and then converts all remaining characters to lowercase. For instance, applying the pattern [^A-Za-z]+ followed by lowercasing transforms 'The Time Machine, by H. G. Wells [1898]' into 'the time machine by h g wells '. This produces a uniform stream of space-separated lowercase words, eliminating noise from punctuation and capitalization that is unnecessary for many language modeling tasks.

Learn Before

Related

Learn After