Learn Before
Parameter Estimation via Conditional Log-Likelihood Maximization
In the context of training a Large Language Model (LLM), the optimal parameters, denoted as , are found by maximizing the conditional log-likelihood across a dataset . This supervised learning objective involves finding the parameters that maximize the sum of the logarithmic probabilities of the true outputs given the inputs , where the probability is predicted by the LLM. The formula is expressed as: In some contexts, the input can be represented by other variables, such as a context and a latent variable , leading to an equivalent formulation:

0
1
Tags
Data Science
Foundations of Large Language Models Course
Computing Sciences
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Related
Relationship between KL Divergence and MLE
Cross-entropy loss
Mean Squared Error
The property of consistency of maximum likelihood
Statistical Efficiency Principal of MLE
Maximum Likelihood Estimator Properties
Log-Likelihood Gradient
Maximum Likelihood Training Objective for a Dataset of Sequences
Kullback-Leibler Divergence
Model Selection via Likelihood
Training Objective as Loss Minimization over a Dataset
Mathematical Equivalence of General and Sequential MLE Objectives
A researcher is modeling a series of coin flips. They observe the following sequence of outcomes: Heads, Tails, Heads, Heads. The researcher wants to find the best parameter for their model, where the parameter represents the probability of the coin landing on Heads. According to the principle of maximum likelihood estimation, which of the following parameter values best explains the observed data?
Parameter Estimation via Conditional Log-Likelihood Maximization
Equivalence of Maximizing Likelihood and Minimizing Loss
Equivalence of Squared Loss and Maximum Likelihood Estimation
Negative Log-Likelihood Objective for Softmax Regression
Learn After
Language Model as a Stochastic Policy
Plackett-Luce Loss Function
A model is being trained by maximizing the sum of log-probabilities for a dataset of 1,000 examples. Consider two scenarios for a single training update:
Scenario A: The probability assigned to the correct output for one example improves from 0.1 to 0.2. The probabilities for all other 999 examples remain unchanged.
Scenario B: The probability assigned to the correct output for one example improves from 0.8 to 0.9. The probabilities for all other 999 examples remain unchanged.
Which scenario leads to a larger increase in the overall training objective function, and why?
Model Comparison using Conditional Log-Likelihood
Evaluating a Training Update