Learn Before
Tokenization
Tokenization is the process of converting a sequence of text into smaller units, known as tokens. It is a foundational step in Natural Language Processing, and there are numerous different methods and strategies for how a text can be tokenized.
0
1
Contributors are:
Who are from:
Tags
Data Science
Foundations of Large Language Models Course
Computing Sciences
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Related
Tokenization
Sentence segmentation
Word normalization
Unix Tools for Crude Tokenization and Normalization
Predictions with Sequences
Sequence Prediction Models
Sequence Classification Models
Recurrent Neural Network (RNN)
Sequence Model Question #1
Sequence Model Question #2
Sequence Moel Question #4
Sequence Model Question #3
Tokenization
Notation for Source and Target Sequences
Learn After
Different standards for tokenization
Inference Process with a Fine-Tuned Model
Example of Tokenization into Words and Punctuation
Example of Word and Punctuation Tokenization
Methods of Tokenization
A language model is given the sentence: 'The researcher is studying neuroplasticity.' It processes the sentence using two different methods, resulting in two different sequences of tokens.
Method A:
['The', 'researcher', 'is', 'studying', 'neuroplasticity', '.']Method B:['The', 'researcher', 'is', 'study', 'ing', 'neuro', 'plasticity', '.']Assuming the model has never encountered the word 'neuroplasticity' during its training but has seen words like 'neuroscience' and 'plasticity' separately, which method is more advantageous for helping the model understand the new word, and why?
Tokenization Strategies
Evaluating Tokenization for a Specialized Chatbot