Calculating the exact probability of a word sequence using the chain rule requires conditioning on the entire history of preceding words, which leads to severe data sparsity and computational issues. The Markov assumption simplifies this by assuming that the probability of a word depends only on a fixed window of preceding words. For an $$n$$-gram model, the history is limited to the previous $$n-1$$ words, approximating $$P(w_k|w_{1:k-1}) \approx P(w_k|w_{k-n+1:k-1})$$.

Google

Represent a sequence of $$n$$ words as either $$w_1, \dots, w_n$$ or $$w_{1:n}$$. The joint probability of observing this exact sequence is denoted as $$P(w_1, \dots, w_n)$$ or $$P(w_{1:n})$$. By applying the chain rule of probability, this joint probability can be decomposed into a product of conditional probabilities: $$P(w_{1:n}) = P(w_1)P(w_2|w_1)\dots P(w_n|w_{1:n-1}) = \prod_{k=1}^{n}P(w_k|w_{1:k-1})$$.

Chain Rule of Probability for Word Sequences

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Speech and Language Processing (3rd ed. draft) 

Markov Assumption in N-Gram Models

The Bigram Model approximates the probability of a word given all previous words $P(w_n|w_{1:n-1})$ by using only the condition probability of the preceding word $P(w_n|w_{n-1})$.

The bigram probability of a word $w_n$ given a previous word $w_{n-1}$ is computed by dividing the count of the bigram $w_{n-1}w_n$ by the count of all bigrams that share the same first word $w_{n-1}$ (which is equivalent to the unigram count for the word $w_{n-1}$):
$$P(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n)}{C(w_{n-1})}$$

Bigram Model

The bigram model could be generalized to the N-Gram Model which approximates the probability by looking n-1 words into the past, hence
 $P(w_n|w_{1:n-1}) ≈ P(w_n|w_{n-N+1:n-1})$.

The general case of n-gram probability of a word $w_n$ is given by
$$P(w_n|w_{n-N+1:n-1}) = \frac{C(w_{n-N+1:n-1}w_n)}{C(w_{n-N+1:n-1})}$$

Learn Before

Related