Calculated by dividing the observed frequency of a particular sequence by the observed frequency of a prefix. It is used as a way to estimate probabilities is an example of maximum likelihood estimation or MLE.

Relative frequency

In computational models (such as n-gram language models), multiplying many small probabilities can cause arithmetic underflow, where the product becomes too small to be represented by standard floating-point numbers and rounds to zero. To prevent this, computations are performed in log space. Transforming the multiplication of probabilities into the addition of their logarithms maintains numerical stability:

$$p_1 \times p_2 \times p_3 \times p_4 = \exp(\log{p_1} + \log{p_2} + \log{p_3} + \log{p_4})$$

Values only need to be converted back to raw probabilities (using the exponential function) at the very end of the process, if necessary.

Log Probabilities

To estimate the parameters of an n-gram model, we can use Maximum Likelihood Estimation (MLE). This involves getting counts from a training corpus and normalizing them so that they lie between 0 and 1. The MLE for an n-gram calculates the conditional probability of the next word by taking its relative frequency:

$$P(w_n|w_{n-N+1:n-1}) = \frac{C(w_{n-N+1:n-1} w_n)}{C(w_{n-N+1:n-1})}$$

University of Michigan - Ann Arbor

Google

The general equations of an n-gram model apply the Markov assumption to estimate word probabilities. The conditional probability for the next word is approximated by looking $$N-1$$ words into the past:

$$P(w_n|w_{1:n-1}) \approx P(w_n|w_{n-N+1:n-1})$$

The probability of a complete word sequence is approximated as the product of these conditional probabilities:

$$P(w_{1:n}) \approx \prod^n_{k=1}P(w_k|w_{k-N+1:k-1})$$

General Equations of an N-Gram Model

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Learn Before

Related

Learn After