There are two common ways to train the probabilities of the unknown word
model <UNK>. The first one is to turn the problem back into a closed vocabulary one by choosing a fixed vocabulary in advance: 1. Choose a vocabulary (word list) that is fixed in advance; 2. Convert in the training set any word that is not in this set (any OOV word) to the unknown word token <UNK> in a text normalization step; 3. Estimate the probabilities for <UNK> from its counts just like any other regular word in the training set.

The second alternative, in situations where we don’t have a prior vocabulary in advance, is to create such a vocabulary implicitly, replacing words in the training data by <UNK> based on their frequency.

Boston University

For words that occur in our test data but are not in our vocabulary (because they did not occur in any training documents),  ignore them—remove them from the test document and not include any probability for them at all.

Unknown words

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Learn Before

Related