The perplexity of a language model can be evaluated across different scenarios based on its prediction accuracy. In the best-case scenario, the model perfectly estimates the target token's probability as $$1$$, resulting in a perplexity of $$1$$. In the worst-case scenario, the model predicts the target token's probability as $$0$$, leading to a perplexity of positive infinity. As a baseline, if the model predicts a uniform distribution over all available tokens, the perplexity equals the number of unique tokens in the vocabulary. This baseline provides a nontrivial upper bound that any useful model must beat.

Claude

Perplexity is a metric used to evaluate the quality of a language model. It is mathematically defined as the exponential of the average cross-entropy loss over a sequence of $$n$$ tokens: $$\exp\left(-\frac{1}{n} \sum_{t=1}^n \log P(x_t \mid x_{t-1}, \ldots, x_1)\right)$$. Conceptually, perplexity represents the reciprocal of the geometric mean of the number of real choices available when deciding the next token. A lower perplexity indicates a better model that predicts the next token with higher accuracy.

Perplexity

Dive into Deep Learning

The Branching Factor of a language is the number of possible next words that can follow any word. Perplexity can be thought of as the Weighted Average Branching Factor of a language.

Branching Factor and Perplexity

Perplexity Evaluation Scenarios

For a test set $$W = w_1, \dots, w_N$$, its perplexity is $$PP(W) = \left(\prod_{i=1}^{N}\frac{1}{P(w_i|w_1, \dots, w_{i-1})}\right)^{\frac{1}{N}}$$.

Learn Before

Related