Perplexity is a metric used to evaluate the quality of a language model. It is mathematically defined as the exponential of the average cross-entropy loss over a sequence of $$n$$ tokens: $$\exp\left(-\frac{1}{n} \sum_{t=1}^n \log P(x_t \mid x_{t-1}, \ldots, x_1)\right)$$. Conceptually, perplexity represents the reciprocal of the geometric mean of the number of real choices available when deciding the next token. A lower perplexity indicates a better model that predicts the next token with higher accuracy.

Perplexity

To measure the quality of a language model and make performance comparable across documents of different lengths, we evaluate it using the cross-entropy loss averaged over all $$n$$ tokens in a sequence. The formula is given by: $$\frac{1}{n} \sum_{t=1}^n -\log P(x_t \mid x_{t-1}, \ldots, x_1)$$, where $$P$$ represents the conditional probability provided by the language model and $$x_t$$ is the actual token observed at time step $$t$$. A better model yields a lower average loss, which conceptually corresponds to spending fewer bits to compress the sequence.

Claude

Models are examined by how well they predict unseen text. Good models assign higher probabilities to unseen data.

Evaluating language models

Dive into Deep Learning

An Extrinsic Evaluation is an end-to-end evaluation of the performance of a language model that embeds it in an application and measure how much the application improves. This is the only way to know if a particular improvement in a component is really going to help the task at hand.


Extrinsic Evaluation

An Intrinsic Evaluation metric is one that measures the quality of a model independent of any application, with the needing of a test corpus. Given a corpus of text, we divide the data into training and test sets, and train the parameters of both models on the training set. Whichever model assigns a high probability to the test set, meaning it more accurately predicts the test set, is a better model.

Intrinsic Evaluation

A good language model predicts upcoming tokens with high accuracy to produce sensible and logically coherent text. For example, given the context "It is raining", a high-quality model might suggest "outside", which is both grammatically and semantically appropriate. A lower-quality model might propose "banana tree", generating a nonsensical extension that still demonstrates basic spelling and word correlation. A poorly trained model that fails to fit the data would output random characters, such as "piouw;kcj pwepoiut".

Learn Before

Related

Learn After