A data scientist is building a language model for a new, specialized domain with a limited amount of text data. They are deciding between using a bigram model (where the probability of a word depends on the single preceding word) and a 5-gram model (where the probability of a word depends on the four preceding words). Evaluate the trade-offs of each choice for this specific scenario. Which model would you recommend and why?

Google

For quite a long period, particularly before 2010, the dominant approach to language modeling was the $$n$$-gram approach. In $$n$$-gram language modeling, we estimate the probability of a word given its preceding $$n-1$$ words, and thus the probability of a sequence can be approximated by the product of a series of $$n$$-gram probabilities. These probabilities are typically estimated by collecting smoothed relative counts of $$n$$-grams in text. This straightforward approach has been extensively used in NLP, and the success of modern statistical speech recognition and machine translation systems has largely depended on the utilization of $$n$$-gram language models.

N-gram Language Modeling

Efficiency considerations are important when building language models that use such large sets of n-grams. Some techniques are:
- Storing words as 64-bit hash numbers in memory as opposed to strings.
- Pruning, only storing n-grams  with counts greater than some threshold.
- Building approximate language models.
- Stupid backoff

Huge Language Models

Represent a sequence of n words as either $w_1, …, w_n$ or $w_{1:n}$, and write the joint probability of each word in a sequence having a particular value as $P(w_1, …, w_n) = P(w_{1:n})$. Applying the chain rule of probability gives:
$$P(w_{1:n}) = P(w_1)...P(w_n|w_{n-1}) = \prod_{k=1}^{n}P(w_k|w_{1:k})$$.

N-Gram Representation

The Bigram Model approximates the probability of a word given all previous words $P(w_n|w_{1:n-1})$ by using only the condition probability of the preceding word $P(w_n|w_{n-1})$.

The bigram probability of a word $w_n$ given a previous word $w_{n-1}$ is computed by dividing the count of the bigram $w_{n-1}w_n$ by the count of all bigrams that share the same first word $w_{n-1}$ (which is equivalent to the unigram count for the word $w_{n-1}$):
$$P(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n)}{C(w_{n-1})}$$

Bigram Model

The bigram model could be generalized to the N-Gram Model which approximates the probability by looking n-1 words into the past, hence
 $P(w_n|w_{1:n-1}) ≈ P(w_n|w_{n-N+1:n-1})$.

The general case of n-gram probability of a word $w_n$ is given by
$$P(w_n|w_{n-N+1:n-1}) = \frac{C(w_{n-N+1:n-1}w_n)}{C(w_{n-N+1:n-1})}$$

N-Gram Model

Procedure of generating random sentences from unigram model:
Let all the words of the English language covering the probability space between 0 and 1, each word covering an interval proportional to its frequency. We choose a random value between 0 and 1 and print the word whose interval includes this chosen value. We continue choosing random numbers and generating words until we randomly generate the sentence-final token /</s>/.

Sentence Generation from Unigram Model

For any n-gram that occurred a sufficient number of times, we might have a good estimate of its probability. But because any corpus is limited, some perfectly acceptable English word sequences are bound to be missing from it. These zero-probability n-grams raise problems if they occur in the test set:
- The model is underestimating the probability of all sorts of words that might occur.
- If the probability of any word in the test set is 0, the entire probability of the test set is 0. As by definition, perplexity is based on the inverse probability of the test set.

Unknown Words and Problem of Sparsity

Despite their relative simplicity, n-gram language models were extensively utilized in NLP and were crucial to the success of major applications. For example, the progress of modern statistical speech recognition and machine translation systems heavily relied on the use of n-gram language models.

Historical Significance and Applications of N-gram Models

A statistical language model is built to predict the next word in a sentence based on the probability of it occurring after the preceding sequence of words. This model is trained exclusively on a massive corpus of texts written in the 19th century. When this model is prompted with the partial sentence, 'To save the file, the user clicked the...', which outcome is the most probable explanation for its behavior?

The 'curse of dimensionality' in the context of traditional language models, such as n-gram models, refers to the challenge posed by representing words as discrete, individual units. This approach leads to an extremely high-dimensional and sparse feature space, as the number of possible word sequences grows exponentially with vocabulary size and context length. This sparsity makes it difficult for models to generalize from the training data and effectively estimate probabilities for unseen n-grams.

Curse of Dimensionality in Traditional Language Models

A developer is tasked with troubleshooting a trigram (n=3) language model. The model assigns a probability of 0 to the test sentence 'The quick brown dog sleeps.' Based on the provided training data, identify the specific reason for this outcome and explain how the model's probability calculation leads to this result.

Learn Before

Related