Learn Before
Concept

Language Model

For natural language data composed of discrete tokens like words, sequence models are specifically called language models. They provide the capacity to evaluate the likelihood of sentences, sample new text sequences, and optimize for the most likely outputs. By applying the chain rule of probability, language modeling can be reduced to an autoregressive prediction problem, decomposing the joint density of a sequence P(x1,,xT)P(x_1, \ldots, x_T) into the product of conditional densities in a left-to-right fashion: P(x1,,xT)=P(x1)t=2TP(xtxt1,,x1)P(x_1, \ldots, x_T) = P(x_1) \prod_{t=2}^T P(x_t \mid x_{t-1}, \ldots, x_1). For discrete signals, the autoregressive model acts as a probabilistic classifier, outputting a full probability distribution over the vocabulary for the next word given the leftwards context.

1

1

Updated 2026-05-13

Tags

Data Science

D2L

Dive into Deep Learning @ D2L