Learn Before
Example
Time Machine Corpus Statistics
After applying the full text-processing pipeline (preprocessing, character-level tokenization, and vocabulary construction) to H. G. Wells' The Time Machine, the resulting corpus contains 173{,}428 token indices and the vocabulary comprises unique tokens. These tokens correspond to the lowercase English letters, the space character, and one special unknown-token symbol. The large ratio of corpus length to vocabulary size is characteristic of character-level tokenization, where the vocabulary is inherently small but every individual character in the text contributes a separate entry to the corpus.
0
1
Updated 2026-05-13
Tags
D2L
Dive into Deep Learning @ D2L