Learn Before
Unknown Words and Problem of Sparsity
For any n-gram that occurred a sufficient number of times, we might have a good estimate of its probability. But because any corpus is limited, some perfectly acceptable English word sequences are bound to be missing from it. These zero-probability n-grams raise problems if they occur in the test set:
- The model is underestimating the probability of all sorts of words that might occur.
- If the probability of any word in the test set is 0, the entire probability of the test set is 0. As by definition, perplexity is based on the inverse probability of the test set.
0
1
Tags
Data Science
Related
Huge Language Models
N-Gram Representation
Bigram Model
N-Gram Model
Sentence Generation from Unigram Model
Unknown Words and Problem of Sparsity
Historical Significance and Applications of N-gram Models
A statistical language model is built to predict the next word in a sentence based on the probability of it occurring after the preceding sequence of words. This model is trained exclusively on a massive corpus of texts written in the 19th century. When this model is prompted with the partial sentence, 'To save the file, the user clicked the...', which outcome is the most probable explanation for its behavior?
Curse of Dimensionality in Traditional Language Models
Analyzing Zero Probability in an N-gram Model
Evaluating N-gram Model Complexity