conditional probability for the next word: $P(w_n|w_{1:n-1}) \approx P(w_n|w_{n-N+1:n-1})$

probability of a complete word sequence: $P(w_{1:n}) \approx \Pi^n_{k=1}P(w_k|w_{k-N+1:k-1})$


University of Illinois at Urbana-Champaign

An n-gram is a sequence of n tokens, where training models based on this are straightforward and have been used as the building block of statistical language modeling for many years. There are also specific names for small values of n, such as unigram for n - 1, or bigram for n - 2, and trigram for n - 3.

n-grams

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Speech and Language Processing (3rd ed. draft) 

The assumption that the probability of a word depends only on the previous word is called a Markov assumption.

Markov used in NLP

To estimate the n-gram probablities, we get estimation normalize the MLE estimate for the parameters of an n-gram model by getting counts from a corpus, and normalizing the counts so that they lie between 0 and 1：
$P(w_n|w_{n-N+1:n-1}) = \frac{C(w_{n-N+1:n-1} wn)}{C(w_{n-N+1:n-1}}$

MLE & Normalizing

Perplexity is a probability-based metric for evaluating language models. It is the weighted average of the number of possible next words that can follow any word, a.k.a. the weighted average branching factor.  

Given a mini-language of 10 words "zero, one ... ten", each word's occurrence probability is 1/10 (unigram), the perplexity is the inverse is 10:

$$
\begin{aligned}
\mathrm{PP}(W) &=P\left(w_{1} w_{2} \ldots w_{N}\right)^{-\frac{1}{N}} \\
&=\left(\frac{1}{10}\right)^{-\frac{1}{N}} \\
&=\frac{1}{10} \\
&=10
\end{aligned}
$$

Learn Before

Related