Maximum Likelihood Estimation for Sequential Data
In the context of sequential data, maximum likelihood estimation aims to find the optimal language model parameters by maximizing the total sequence-level log-likelihood across a given dataset . This objective of maximum likelihood training is formally defined as: , where represents the sum of the conditional log-probabilities for an individual complete sequence.
0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.2 Generative Models - Foundations of Large Language Models
Related
Maximum Likelihood Estimation for Sequential Data
Fine-Tuning as Maximum Likelihood Estimation
Log-Probability Decomposition for Efficient Multi-Turn Dialogue Training
A language model is being trained on a dataset containing a mix of very short sequences and a few extremely long sequences. A developer observes that the overall training objective, which is the sum of the log-probabilities of all sequences in the dataset, seems to be disproportionately influenced by the model's performance on the few long sequences. Which of the following best explains this observation?
Model Parameter Selection via Likelihood
A language model is being trained on a large dataset of text sequences. After a single parameter update, the model's calculated log-probability for one specific sequence in the dataset increases by 2.5, while the log-probabilities for all other sequences in the dataset remain exactly the same. How does this change affect the overall maximum likelihood training objective for the entire dataset?
Standard Optimization Objective for Transformer Language Models
Maximum Likelihood Estimation for Sequential Data
In training a model on a dataset (D) of sequences (\mathbf{x}), a primary goal is to find parameters that maximize the total log-probability of the observed sequences. This objective can be expressed in two equivalent ways:
Form 1:
Form 2:
What fundamental principle of probability justifies the mathematical equivalence between Form 1 and Form 2?
Verifying Log-Probability Equivalence
Analysis of a Language Model Training Objective
The mathematical equivalence between maximizing the log-probability of an entire sequence,
log Pr(x), and maximizing the sum of its conditional log-probabilities,Σ log Pr(x_i | x_<i), is established because the logarithm function transforms a sum of probabilities into a product of log-probabilities.
Learn After
Log-Likelihood Objective for Language Model Training
Formulating the MLE Objective for a Small Dataset
Total Loss Calculation for a Token Sequence
A model is being trained on a dataset containing just two sequences:
seq_1 = (x_0, x_1)andseq_2 = (y_0, y_1, y_2). According to the principle of maximum likelihood estimation for sequential data, which expression correctly represents the decomposed log-probability that the model aims to maximize for this entire dataset?When training a model on a sequence of data using the Maximum Likelihood Estimation objective, a single prediction with a very low conditional probability for one element in the sequence can have a disproportionately large negative impact on the total log-probability calculated for that entire sequence.
Pre-trained Language Model Decoder Inference