Example

Time Machine Corpus Statistics

After applying the full text-processing pipeline (preprocessing, character-level tokenization, and vocabulary construction) to H. G. Wells' The Time Machine, the resulting corpus contains 173,428173{,}428 token indices and the vocabulary comprises 2828 unique tokens. These 2828 tokens correspond to the 2626 lowercase English letters, the space character, and one special unknown-token symbol. The large ratio of corpus length to vocabulary size is characteristic of character-level tokenization, where the vocabulary is inherently small but every individual character in the text contributes a separate entry to the corpus.

0

1

Updated 2026-05-13

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L