Ways to train the probabilities of the unknown word model:
- Choose a vocabulary (word list) that is fixed in advance; convert in the training set any OOV word to the unknown word token /<UNK>/ in a text normalization step; and estimate the probabilities for /<UNK>/ from its counts just like any other regular word in the training set.
- Create a vocabulary implicitly by replacing words in the training data by /<UNK>/ based on their frequency; and estimate the probabilities for /<UNK>/ like before.

University of Michigan - Ann Arbor

For any n-gram that occurred a sufficient number of times, we might have a good estimate of its probability. But because any corpus is limited, some perfectly acceptable English word sequences are bound to be missing from it. These zero-probability n-grams raise problems if they occur in the test set:
- The model is underestimating the probability of all sorts of words that might occur.
- If the probability of any word in the test set is 0, the entire probability of the test set is 0. As by definition, perplexity is based on the inverse probability of the test set.

Unknown Words and Problem of Sparsity

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Speech and Language Processing (3rd ed. draft) 

A Closed Vocabulary system is one in which the test set can only contain words from this lexicon, and there will be no unknown words.

Closed Vocabulary

An Open Vocabulary system is one in which we model these potential unknown words in the test set by adding a pseudo-word called /<UNK>/.


Open Vocabulary

Out of Vocabulary (OOV) words are unknown words that do not occur in the training set. The percentage of OOV words that appear in the test set is called the OOV Rate.

Out of Vocabulary (OOV)

Training Unknown Word Model

Smoothing (Discounting) is the modification that keeps a language model from assigning zero probability to unseen events, by shaving off a bit of probability mass from some more frequent events and give it to the events that were never seen.

Learn Before

Related