Learn Before
Distinguishing Words from Tokens
Consider the sentence: The cat's toy isn't here. First, count the number of words in the sentence. Then, determine how many tokens would be generated if a tokenizer follows these two rules:
- It separates punctuation from words (e.g.,
here.becomeshereand.). - It splits common contractions and possessives (e.g.,
cat'sbecomescatand's;isn'tbecomesisandn't).
Finally, explain why the word count and the token count are different.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Consider the sentence:
The model's performance isn't great.This sentence is processed using two different methods for breaking down text into basic units (tokens), resulting in the following outputs:- Method A:
['The', 'model', ''s', 'performance', 'is', 'n't', 'great', '.'] - Method B:
['The', 'model's', 'performance', 'isn't', 'great', '.']
By analyzing the differences between these two lists of tokens, what can be inferred about the underlying rules of each method?
- Method A:
Distinguishing Words from Tokens
A programmer is using a specific method to break down the sentence "Let's re-evaluate the model's performance." into a list of basic units. The method's rules are: 1) Split the text by spaces, and 2) Treat each punctuation mark (like '-', ''', and '.') as a separate unit. Which of the following outputs correctly applies these rules?