When preparing a text corpus for sequence modeling, a practical simplification is to tokenize the text at the character level rather than the word level. In character-level tokenization, each individual character becomes a separate token, yielding a very small vocabulary (e.g., twenty-eight unique characters for a lowercased English text with spaces) at the cost of producing much longer token sequences. This choice is often made to simplify training in early experiments because a small, fixed vocabulary eliminates complications from rare or unknown words and avoids the need for more sophisticated sub-word segmentation methods.

Claude

Google

- Word tokenization: Penn Treebank tokenization; NLTK
- Character tokenization
- Subword tokenization: byte-pair encoding(BPE); wordpiece algorithm with MaxMatch decoding; SentencePiece

Different standards for tokenization

Dive into Deep Learning

Character-Level Tokenization

Byte pair encoding is a greedy algorithm that performs statistical analysis on a training dataset to identify common symbols within words. It works by iteratively locating the most frequently occurring pair of consecutive symbols and merging them into a single, new symbol.

Learn Before

Related