1Cademy - Unknown Words and Problem of Sparsity

Learn Before

N-gram Language Modeling

Concept

Unknown Words and Problem of Sparsity

For any n-gram that occurred a sufficient number of times, we might have a good estimate of its probability. But because any corpus is limited, some perfectly acceptable English word sequences are bound to be missing from it. These zero-probability n-grams raise problems if they occur in the test set:

The model is underestimating the probability of all sorts of words that might occur.
If the probability of any word in the test set is 0, the entire probability of the test set is 0. As by definition, perplexity is based on the inverse probability of the test set.