- Word tokenization: Penn Treebank tokenization; NLTK
- Character tokenization
- Subword tokenization: byte-pair encoding(BPE); wordpiece algorithm with MaxMatch decoding; SentencePiece

Different standards for tokenization

Once a model's parameters have been optimized through fine-tuning, the resulting model, denoted as $$F_{\tilde{\omega},\tilde{\theta}}(\cdot)$$, can be used for inference on new, unseen data. For instance, in a text classification task, the new text is first tokenized into a sequence of tokens, represented as $$\mathbf{x}_{\mathrm{new}}$$. This token sequence is then fed into the fine-tuned model, which processes it to generate a probability distribution over the possible classes.

Inference Process with a Fine-Tuned Model

A simple and straightforward approach to tokenization is to segment a text into individual English words and punctuation marks. For instance, given the text "I love the food here. It's amazing!", it can be broken down into the following sequence of tokens: $$\left\{ \textrm{I}, \textrm{love}, \textrm{the}, \textrm{food}, \textrm{here}, \textrm{.}, \textrm{It}, \textrm{'s}, \textrm{amazing}, \textrm{!} \right\}$$.

Example of Tokenization into Words and Punctuation

A fundamental method of tokenization involves segmenting a text into its constituent English words and punctuation marks. For example, the phrase 'I love the food here. It’s amazing' would be tokenized into the following sequence of units: {I, love, the, food, here, ., It, ’s, amazing}.

Example of Word and Punctuation Tokenization

Tokenization, the process of breaking down text into smaller units called tokens, can be performed using various strategies. A fundamental and straightforward method involves segmenting the text based on its constituent words and punctuation marks.

Methods of Tokenization

A language model is given the sentence: 'The researcher is studying neuroplasticity.' It processes the sentence using two different methods, resulting in two different sequences of tokens.

Method A: `['The', 'researcher', 'is', 'studying', 'neuroplasticity', '.']`
Method B: `['The', 'researcher', 'is', 'study', 'ing', 'neuro', 'plasticity', '.']`

Assuming the model has never encountered the word 'neuroplasticity' during its training but has seen words like 'neuroscience' and 'plasticity' separately, which method is more advantageous for helping the model understand the new word, and why?

Consider the sentence: 'State-of-the-art models often struggle with out-of-vocabulary words.' Propose two different, valid ways this sentence could be broken down into a sequence of smaller units (tokens). For each method, briefly explain the reasoning or rule you applied.

Tokenization Strategies

Based on the case study, evaluate the company's choice of a simple word-based tokenization strategy. Explain the primary reason for the chatbot's poor performance and recommend a more suitable tokenization approach, justifying why it would be more effective in this specific context.

Evaluating Tokenization for a Specialized Chatbot

Tokenization is the process of converting a sequence of text into smaller units, known as tokens. It is a foundational step in Natural Language Processing, and there are numerous different methods and strategies for how a text can be tokenized.

Google

University of Michigan - Ann Arbor

The process of text normalization includes:
- Tokenizing (segmenting) words 
- Normalizing word formats (case folding/lemmatization/stemming)
- Segmenting sentences

Process of text normalization

Sequence models are statistical supervised learning functions designed to process, predict, or classify based on sequence data. A fundamental task in this domain is to estimate the joint probability of an entire sequence. When dealing with natural language data composed of discrete tokens, such as words, these estimated functions are commonly referred to as language models. Sequence models provide the capacity to evaluate the likelihood of sequences (for example, comparing the naturalness of outputs from translation systems), sample new sequences, and optimize for the most likely sequence outputs.

Sequence Models

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Speech and Language Processing (3rd ed. draft) 

Reference of Foundations of Large Language Models Course

Tokenization

Punctuation, like periods, question marks, and exclamation points, are the most useful sentence-boundary markers. However, punctuation like periods are ambiguous among a sentence boundary marker, a number marker, and a marker of abbreviations. For this reason, sentence tokenization and word tokenization may be addressed jointly.

Sentence segmentation

Word normalization is the task of unifying words/tokens to a single standard format. The methods below are widely used depending on the tasks.

- Case folding
- Lemmatization
- Stemming

Word normalization

Unix commands such as tr, sort, and uniq can be used for simple normalization, tokenization, and frequency computation.

Unix Tools for Crude Tokenization and Normalization


After downloading the raw text of a machine translation dataset, several preprocessing steps are necessary to standardize the format and reduce noise before tokenization. Common preprocessing techniques include replacing non-breaking spaces with standard spaces, converting all uppercase letters to lowercase, and inserting spaces between words and punctuation marks so that punctuation symbols are subsequently treated as separate tokens.

Preprocessing Machine Translation Datasets

https://machinelearningmastery.com/sequence-prediction/

Predictions with Sequences

Sequence prediction is used to predict the next value of a given input sequence. One example is product recommendations, which predicts a customer's next purchase given a sequence of their past purchases. Another example is stock market prediction.

Sequence Prediction Models

Sequence classification predicts a class label for a given input sequence. Sentiment analysis is one example, which takes a sequence of text and predicts the label for the sentiment (positive or negative). 

Sequence Classification Models

Recurrent neural networks, or RNNs, are a family of neural networks for processing sequential data. A recurrent neural network is specialized for processing a sequence of inputs $x^{(1)}, . . . , x^{(τ)}$ and each time add additional layers of comprehension on top of the previous inputs. 


Recurrent Neural Network (RNN)

You have trained word embeddings using a text dataset of $$m_1$$ words. You are considering using these word embeddings for a language task, for which you have a separate labeled dataset of $$m_2$$ words. Keeping in mind that using word embeddings is a form of transfer learning, under which of these circumstance would you expect the word embeddings to be helpful?

Sequence Model Question #1

In the word2vec algorithm, you estimate $$P\left(t|c\right)$$, where t is the target word and c is a context word. How are t and c chosen from the training set?

Sequence Model Question #2

Suppose you learn a word embedding for a vocabulary of 10000 words. Then the embedding vectors should be 10000 dimensional, so as to capture the full range of variation and meaning in those words.

Sequence Moel Question #4

In beam search, if you increase the beam width B, which of the following would you expect to be true? Check all that apply.

Sequence Model Question #3

In sequence modeling architectures, standard notation is used to represent the inputs and outputs. A source sequence is typically denoted as $$x_1...x_m$$, where $$m$$ represents the length of the source sequence. Its corresponding sequence of embedding vectors is denoted as $$\mathbf{e}_{1}^{x}...\mathbf{e}_{m}^{x}$$. Similarly, the target sequence that the model aims to generate is denoted as $$y_1...y_n$$, with length $$n$$, and its corresponding embedding sequence is represented as $$\mathbf{e}_{1}^{y}...\mathbf{e}_{n}^{y}$$.

Notation for Source and Target Sequences

In sequence modeling, interpolation and extrapolation represent two fundamentally different challenges with a significant gap in difficulty. Interpolation involves estimating values that fall within the temporal range of data the model has already observed, while extrapolation requires predicting future, unseen values beyond the observed range. Because extrapolation demands the model generalize beyond its training distribution, it is substantially harder than interpolation. This asymmetry has a critical practical implication: when working with sequential data, one must always respect the temporal order of observations during training—that is, a model should never be trained on future data to predict the past.

Interpolation vs. Extrapolation in Sequence Models

For causal sequence models—those in which time progresses naturally forward—estimating values in the forward direction (predicting the future from past observations) is typically much easier and more practical than estimating in the reverse direction (reconstructing the past from future observations). This asymmetry arises because causal models are designed to capture how past events generate future outcomes, making the forward conditional distribution the natural object of estimation. Attempting to invert this process is inherently more challenging because the mapping from effects back to causes is often one-to-many and ill-conditioned.

Forward vs. Reverse Direction Estimation

The initial step in preparing text for sequence modeling is to read the raw text from a dataset into memory as a continuous string. For example, a complete text file, such as a book, can be downloaded and loaded as a single sequence of characters before any further processing occurs.

Learn Before

Related

Learn After