$$P_{KN}(w_i|w_{i-1}) = \frac{max(C(w_{i-1}w_i) - d, 0)}{C(w_{i-1})}+\lambda(w_{i-1})P_{CONTINUATION}(w_i)$$ Where $$\lambda$$ is a normalizing constant to distribute probability mass: $$\lambda(w_{i-1}) = \frac{d}{\sum_vC(w_{i-1}v)}|\{w:C(w_{i-1}w)>0\}|$$

University of California, Santa Cruz

University of Michigan - Ann Arbor

Google

Kneser-Ney discounting augments absolute discounting to better handle low-order unigram distributions. Rather than estimating the probability of a word w simply appearing, we create a unigram model that estimates the probability of w appearing as a novel continuation, completing some previously unseen bigram. This estimate is based on the number of bigrams w has previously completed, with the intuition that words appearing in many different contexts will continue to do so, as opposed to those with more situational usage.

Kneser-Ney Discounting

Interpolated Kneser-Ney Smoothing for Bigrams

The best performing version of Kneser-Ney smoothing is called Modified Kneser-Ney Smoothing. Rather than use a single fixed discount $$d$$, modified Kneser-Ney uses three different discounts $$d_1$$, $$d_2$$, and $$d_3+$$ for n-grams with counts of 1, 2 and three or more, respectively.

Learn Before

Related