Learn Before
Inferring Tokenization Rules
A tokenizer processes the sentence "The U.S.A. is a country. It's great!" and produces the following list of tokens: ['The', 'U.S.A.', 'is', 'a', 'country', '.', 'It's', 'great', '!']. Based on this output, describe two specific rules the tokenizer likely followed to separate the original text.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A common approach to breaking down text into smaller units involves two steps: first, splitting the text by spaces, and second, separating any punctuation that is attached to the beginning or end of the resulting pieces. Based on this method, how would the sentence "Let's go to the park!" be broken down?
Inferring Tokenization Rules
When using a word-and-punctuation-based method to break down text, the text is always segmented into tokens simply by splitting it at every space character.