1Cademy - Methods of Tokenization

Learn Before

Tokenization

Concept

Methods of Tokenization

Tokenization, the process of breaking down text into smaller units called tokens, can be performed using various strategies. A fundamental and straightforward method involves segmenting the text based on its constituent words and punctuation marks.

Updated 2026-04-14

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

A common approach to breaking down text into smaller units involves two steps: first, splitting the text by spaces, and second, separating any punctuation that is attached to the beginning or end of the resulting pieces. Based on this method, how would the sentence "Let's go to the park!" be broken down?
Inferring Tokenization Rules
When using a word-and-punctuation-based method to break down text, the text is always segmented into tokens simply by splitting it at every space character.

Learn Before

Related

Learn After