Learn Before
Methods of Tokenization
Tokenization, the process of breaking down text into smaller units called tokens, can be performed using various strategies. A fundamental and straightforward method involves segmenting the text based on its constituent words and punctuation marks.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Different standards for tokenization
Inference Process with a Fine-Tuned Model
Example of Tokenization into Words and Punctuation
Example of Word and Punctuation Tokenization
Methods of Tokenization
A language model is given the sentence: 'The researcher is studying neuroplasticity.' It processes the sentence using two different methods, resulting in two different sequences of tokens.
Method A:
['The', 'researcher', 'is', 'studying', 'neuroplasticity', '.']Method B:['The', 'researcher', 'is', 'study', 'ing', 'neuro', 'plasticity', '.']Assuming the model has never encountered the word 'neuroplasticity' during its training but has seen words like 'neuroscience' and 'plasticity' separately, which method is more advantageous for helping the model understand the new word, and why?
Tokenization Strategies
Evaluating Tokenization for a Specialized Chatbot
Learn After
A common approach to breaking down text into smaller units involves two steps: first, splitting the text by spaces, and second, separating any punctuation that is attached to the beginning or end of the resulting pieces. Based on this method, how would the sentence "Let's go to the park!" be broken down?
Inferring Tokenization Rules
When using a word-and-punctuation-based method to break down text, the text is always segmented into tokens simply by splitting it at every space character.