For a test set $W = w_1…w_n$, its perplexity is
$$PP(W) = \left(\prod_{i=1}^{N}\frac{1}{P(w_i|w_1...w_{i-1})}\right)^{\frac{1}{N}}$$

Computing Perplexity

The Branching Factor of a language is the number of possible next words that can follow any word. Perplexity can be thought of as the Weighted Average Branching Factor of a language.

Branching Factor and Perplexity

The perplexity of a language model can be evaluated across different scenarios based on its prediction accuracy. In the best-case scenario, the model perfectly estimates the target token's probability as $$1$$, resulting in a perplexity of $$1$$. In the worst-case scenario, the model predicts the target token's probability as $$0$$, leading to a perplexity of positive infinity. As a baseline, if the model predicts a uniform distribution over all available tokens, the perplexity equals the number of unique tokens in the vocabulary. This baseline provides a nontrivial upper bound that any useful model must beat.

Perplexity Evaluation Scenarios

Perplexity is a metric used to evaluate the quality of a language model. It is mathematically defined as the exponential of the average cross-entropy loss over a sequence of $$n$$ tokens: $$\exp\left(-\frac{1}{n} \sum_{t=1}^n \log P(x_t \mid x_{t-1}, \ldots, x_1)\right)$$. Conceptually, perplexity represents the reciprocal of the geometric mean of the number of real choices available when deciding the next token. A lower perplexity indicates a better model that predicts the next token with higher accuracy.

University of Michigan - Ann Arbor

Claude

University of Texas at San Antonio

Models are examined by how well they predict unseen text. Good models assign higher probabilities to unseen data.

Evaluating language models

To measure the quality of a language model and make performance comparable across documents of different lengths, we evaluate it using the cross-entropy loss averaged over all $$n$$ tokens in a sequence. The formula is given by: $$\frac{1}{n} \sum_{t=1}^n -\log P(x_t \mid x_{t-1}, \ldots, x_1)$$, where $$P$$ represents the conditional probability provided by the language model and $$x_t$$ is the actual token observed at time step $$t$$. A better model yields a lower average loss, which conceptually corresponds to spending fewer bits to compress the sequence.

Average Cross-Entropy Loss for Sequence Modeling

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Speech and Language Processing (3rd ed. draft) 

Dive into Deep Learning

An Extrinsic Evaluation is an end-to-end evaluation of the performance of a language model that embeds it in an application and measure how much the application improves. This is the only way to know if a particular improvement in a component is really going to help the task at hand.


Extrinsic Evaluation

An Intrinsic Evaluation metric is one that measures the quality of a model independent of any application, with the needing of a test corpus. Given a corpus of text, we divide the data into training and test sets, and train the parameters of both models on the training set. Whichever model assigns a high probability to the test set, meaning it more accurately predicts the test set, is a better model.

Intrinsic Evaluation

A good language model predicts upcoming tokens with high accuracy to produce sensible and logically coherent text. For example, given the context "It is raining", a high-quality model might suggest "outside", which is both grammatically and semantically appropriate. A lower-quality model might propose "banana tree", generating a nonsensical extension that still demonstrates basic spelling and word correlation. A poorly trained model that fails to fit the data would output random characters, such as "piouw;kcj pwepoiut".

Learn Before

Related

Learn After