Maximum Likelihood Training Objective for a Dataset of Sequences
The training objective under the Maximum Likelihood Estimation (MLE) framework is to find the model parameters, , that maximize the total log-probability of all sequences in a dataset . This is achieved by summing the log-probabilities of each individual sequence, seq, as calculated by the model parameterized by . The general objective is formally expressed as: For datasets composed of input-output pairs , this objective can be specified as maximizing the joint log-probability of the concatenated sequences: This approach is equivalent to maximizing the sum of the log-likelihoods for all data points in the training set.
0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.2 Generative Models - Foundations of Large Language Models
Ch.4 Alignment - Foundations of Large Language Models
Related
Relationship between KL Divergence and MLE
Cross-entropy loss
Mean Squared Error
The property of consistency of maximum likelihood
Statistical Efficiency Principal of MLE
Maximum Likelihood Estimator Properties
Log-Likelihood Gradient
Maximum Likelihood Training Objective for a Dataset of Sequences
Kullback-Leibler Divergence
Model Selection via Likelihood
Training Objective as Loss Minimization over a Dataset
Mathematical Equivalence of General and Sequential MLE Objectives
A researcher is modeling a series of coin flips. They observe the following sequence of outcomes: Heads, Tails, Heads, Heads. The researcher wants to find the best parameter for their model, where the parameter represents the probability of the coin landing on Heads. According to the principle of maximum likelihood estimation, which of the following parameter values best explains the observed data?
Parameter Estimation via Conditional Log-Likelihood Maximization
Equivalence of Maximizing Likelihood and Minimizing Loss
Equivalence of Squared Loss and Maximum Likelihood Estimation
Negative Log-Likelihood Objective for Softmax Regression
Maximum Likelihood Training Objective for a Dataset of Sequences
A language model is defined by the following table of conditional log-probabilities, where
<s>is the start-of-sequence token and<eos>is the end-of-sequence token:| Log-Probability | Value | |---|---| |
log Pr(A | <s>)| -0.5 | |log Pr(B | <s>)| -1.5 | |log Pr(B | A)| -0.2 | |log Pr(A | B)| -1.0 | |log Pr(<eos> | A)| -2.0 | |log Pr(<eos> | B)| -0.1 |Given a training dataset
Dcontaining two sequences:- Sequence 1:
(A, B, <eos>) - Sequence 2:
(B, A, <eos>)
Calculate the log-likelihood for each individual sequence in the dataset. Which of the following options correctly lists the results?
- Sequence 1:
Verifying Language Model Performance on a Small Dataset
You are tasked with evaluating a language model's performance on a dataset composed of multiple text sequences. Arrange the following steps in the correct logical order to compute the log-likelihood for each individual sequence in the dataset.
Learn After
Maximum Likelihood Estimation for Sequential Data
Fine-Tuning as Maximum Likelihood Estimation
Log-Probability Decomposition for Efficient Multi-Turn Dialogue Training
A language model is being trained on a dataset containing a mix of very short sequences and a few extremely long sequences. A developer observes that the overall training objective, which is the sum of the log-probabilities of all sequences in the dataset, seems to be disproportionately influenced by the model's performance on the few long sequences. Which of the following best explains this observation?
Model Parameter Selection via Likelihood
A language model is being trained on a large dataset of text sequences. After a single parameter update, the model's calculated log-probability for one specific sequence in the dataset increases by 2.5, while the log-probabilities for all other sequences in the dataset remain exactly the same. How does this change affect the overall maximum likelihood training objective for the entire dataset?
Standard Optimization Objective for Transformer Language Models