Learn Before
When using a word-and-punctuation-based method to break down text, the text is always segmented into tokens simply by splitting it at every space character.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Comprehension in Revised Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A common approach to breaking down text into smaller units involves two steps: first, splitting the text by spaces, and second, separating any punctuation that is attached to the beginning or end of the resulting pieces. Based on this method, how would the sentence "Let's go to the park!" be broken down?
Inferring Tokenization Rules
When using a word-and-punctuation-based method to break down text, the text is always segmented into tokens simply by splitting it at every space character.