Activity (Process)

Character-Level Tokenization

When preparing a text corpus for sequence modeling, a practical simplification is to tokenize the text at the character level rather than the word level. In character-level tokenization, each individual character becomes a separate token, yielding a very small vocabulary (e.g., twenty-eight unique characters for a lowercased English text with spaces) at the cost of producing much longer token sequences. This choice is often made to simplify training in early experiments because a small, fixed vocabulary eliminates complications from rare or unknown words and avoids the need for more sophisticated sub-word segmentation methods.

0

1

Updated 2026-05-17

Tags

D2L

Dive into Deep Learning @ D2L