Learn Before
Example
Time Machine Corpus Statistics
After applying the full text-processing pipeline (preprocessing, character-level tokenization, and vocabulary construction) to H. G. Wells' The Time Machine, the resulting corpus contains token indices and the vocabulary comprises unique tokens. These tokens correspond to the lowercase English letters, the space character, and one special unknown-token symbol. The large ratio of corpus length to vocabulary size is characteristic of character-level tokenization, where the vocabulary is inherently small but every individual character in the text contributes a separate entry to the corpus.
0
1
Updated 2026-05-13
Tags
D2L
Dive into Deep Learning @ D2L