Learn Before
Example of Word and Punctuation Tokenization
A fundamental method of tokenization involves segmenting a text into its constituent English words and punctuation marks. For example, the phrase 'I love the food here. It’s amazing' would be tokenized into the following sequence of units: {I, love, the, food, here, ., It, ’s, amazing}.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Different standards for tokenization
Inference Process with a Fine-Tuned Model
Example of Tokenization into Words and Punctuation
Example of Word and Punctuation Tokenization
Methods of Tokenization
A language model is given the sentence: 'The researcher is studying neuroplasticity.' It processes the sentence using two different methods, resulting in two different sequences of tokens.
Method A:
['The', 'researcher', 'is', 'studying', 'neuroplasticity', '.']Method B:['The', 'researcher', 'is', 'study', 'ing', 'neuro', 'plasticity', '.']Assuming the model has never encountered the word 'neuroplasticity' during its training but has seen words like 'neuroscience' and 'plasticity' separately, which method is more advantageous for helping the model understand the new word, and why?
Tokenization Strategies
Evaluating Tokenization for a Specialized Chatbot
Learn After
A tokenization process is designed to segment text into individual English words and punctuation marks. For example, the phrase 'It’s great.' is tokenized into
['It', '’s', 'great', '.']. Based on this rule, how would the sentence 'The student's book isn't here.' be tokenized?Applying Word and Punctuation Tokenization
Consider a tokenization method that segments text into individual English words and punctuation marks. For instance, 'It’s great.' becomes
['It', '’s', 'great', '.']. True or False: Following this method, the phrase 'We're going home.' would be tokenized as['We', '’re', 'going', 'home.'].