Minimizing the mean squared error is mathematically equivalent to performing maximum likelihood estimation for a linear model under the assumption of additive Gaussian noise. In the negative log-likelihood objective for linear regression, if the standard deviation $$\sigma$$ is assumed to be fixed, the term $$\frac{1}{2} \log(2 \pi \sigma^2)$$ becomes a constant that can be ignored during optimization. The remaining term is identical to the squared error loss, except for the multiplicative constant $$\frac{1}{\sigma^2}$$, which does not alter the location of the minimum.

University of Michigan - Ann Arbor

Claude

It's an important principal in selecting the model and describing the math formula of a model's loss function. Let $p_{model}(x,\theta)$ be a parametric family of possibility distribution over the same space indexed by $\theta$. The maximum likelihood estimator for $\theta$ is then defuned as 
$\theta_{ML} = argmax_{\theta}p_{model}(X;\theta) $

Maximum Likelihood Estimation

The squared error is the standard loss function for regression problems, used to quantify the discrepancy between a single predicted value $$\hat{y}^{(i)}$$ and its true corresponding label $$y^{(i)}$$. The loss for a single example is defined mathematically as $$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}ight)^2$$. The quadratic form heavily penalizes large differences, and the constant fraction $$\frac{1}{2}$$ is conventionally added because it cleanly cancels out when taking the derivative during optimization.

Squared Error Loss

Goodfellow, I., Bengio, Y., & Courville, A. (2016). $\mathit{Deep \ Learning.}$ MIT Press. Retrieved from [www.deeplearningbook.org](https://www.deeplearningbook.org) 

Deep Learning

Dive into Deep Learning

One way to interpret maximum likelihood estimation is to view it as minimizing the dissimilarity between the empirical distribution and the model distribution. We can measure the degree of the dissimilarity using KL divergence.

Relationship between KL Divergence and MLE

The cross-entropy loss function works very well for models that predict binary classes (aka the output is between 0 and 1). It is defined as -[y*log(y-hat) +(1-y)*log(1-(y-hat))]. If y=0 the left side of the function is dropped and the right side, -log(1-(y-hat)), is used. Otherwise if y=1 the right side of the function is dropped and it uses -log(y-hat). In both instances this loss function encourages probabilities that are close to the true probability. 

Cross-entropy loss

Mean Squared Error is simply the variance of a potentially biased estimator $\hat{\theta}_m$ of the Expected Value or Mean $\theta$
$$MSE(\hat{\theta}_m) = \mathbb{E}((\hat{\theta}_m - \theta)^2)$$
$$ = Bias^2(\hat{\theta}_m) + Var(\hat{\theta}_m)$$

Mean Squared Error

Under appropriate conditions, as the number of training examples approaches inﬁnity, the maximum likelihood estimate of a parameter converges to the true value of the parameter.

The property of consistency of maximum likelihood

One consistent estimator may obtain lower generalization error for a ﬁxed number of samples $m$, or equivalently, may require fewer examples to obtain a ﬁxed level of generalization error.

Statistical Eﬃciency Principal of MLE

- The true distribution $p_{\text{data}}$ must lie within the model family $p_{\text{model}}(.;\theta)$. Otherwise, no esitmator can recover $p_{\text{data}}$.
- The true distribution $p_{\text{data}}$ must correspond exactly to one value of $\theta$. Otherwise, the maximum likelihood can recover the correct $p_{\text{data}}$ but will not be able to determine which value of $\theta$ was used by the data generating process.

Maximum Likelihood Estimator Properties

The gradient of the log-likelihood which is used in Maximum Likelihood Estimation can be decomposed into the following:
$$\nabla_{\boldsymbol \theta} \log p(\mathbf{x}; \boldsymbol \theta) = \nabla_{\boldsymbol \theta} \log \tilde{p}(\mathbf{x}; \boldsymbol \theta) - \nabla_{\boldsymbol \theta } \log Z(\boldsymbol \theta)$$
Where $\tilde{p}(\mathbf{x}; \boldsymbol \theta)$ is the unnormalized probabilioty density and $Z$ is the partition function. This is well-known as the decomposition into the positive phase and negative phase of learning. Due to the reliance of the partition function on the parameters, learning models by maximum likelihood is particularly difficult. 

Log-Likelihood Gradient

The training objective under the Maximum Likelihood Estimation (MLE) framework is to find the model parameters, $\tilde{\theta}$, that maximize the total log-probability of all sequences in a dataset $\mathcal{D}$. This is achieved by summing the log-probabilities of each individual sequence, `seq`, as calculated by the model parameterized by $\theta$. The general objective is formally expressed as: $$ \tilde{\theta} = \arg\max_{\theta} \sum_{\text{seq} \in \mathcal{D}} \log \text{Pr}_{\theta}(\text{seq}) $$ For datasets composed of input-output pairs $(\mathbf{x}, \mathbf{y})$, this objective can be specified as maximizing the joint log-probability of the concatenated sequences: $$ \tilde{\theta} = \underset{\theta}{\arg\max} \sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}} \log \text{Pr}_{\theta}(\text{seq}_{\mathbf{x},\mathbf{y}}) $$ This approach is equivalent to maximizing the sum of the log-likelihoods for all data points in the training set.

Maximum Likelihood Training Objective for a Dataset of Sequences

Kullback-Leibler (KL) divergence, also known as relative entropy, measures how one probability distribution diverges from a second, reference probability distribution. For discrete probability distributions $P$ and $Q$ defined on the same probability space, the KL divergence from $Q$ to $P$, denoted $D_{\text{KL}}(P \|\| Q)$, is the expectation of the logarithmic difference between the probabilities given by the two distributions, where the expectation is taken using the probabilities of $P$. The formula is: $$ D_{\text{KL}}(P \|\| Q) = \sum_{\mathbf{x}} P(\mathbf{x}) \log\left(\frac{P(\mathbf{x})}{Q(\mathbf{x})}\right) = \mathbb{E}_{\mathbf{x} \sim P} [\log P(\mathbf{x}) - \log Q(\mathbf{x})] $$ KL divergence is non-negative ($D_{\text{KL}}(P \|\| Q) \ge 0$) and is zero if and only if $P$ and $Q$ are identical. It is an asymmetric measure, meaning that $D_{\text{KL}}(P \|\| Q)$ is generally not equal to $D_{\text{KL}}(Q \|\| P)$.

Kullback-Leibler Divergence

You are evaluating two simple probabilistic models, Model A and Model B, on their ability to represent a single observed data sequence from a series of coin flips: `[Heads, Tails, Heads]`. The models assume each flip is independent, so the probability of the sequence is the product of the probabilities of the individual outcomes. Given the probabilities assigned by each model below, which model provides a better fit for the observed data according to the principle of maximum likelihood estimation? Justify your answer with a calculation.

Model Selection via Likelihood

The objective of training a model is to find the set of parameters, $$\hat{\theta}$$, that minimizes the total loss across an entire dataset of sequences, $$\mathcal{D}$$. This optimization problem is formally expressed as:
$$ \hat{\theta} = \argmin_{\theta} \sum_{\mathbf{x} \in \mathcal{D}} \mathrm{Loss}_{\theta}(\mathbf{x}) $$
This loss minimization objective is mathematically equivalent to the principle of Maximum Likelihood Estimation (MLE). Specifically, when the loss function is defined as the negative log-likelihood of the data, minimizing the loss is the same as maximizing the likelihood of the data given the model parameters.

Training Objective as Loss Minimization over a Dataset

The general maximum likelihood estimation formulation for a dataset $$\mathcal{D}$$ can be re-expressed for sequential data by applying the chain rule of probability. This adaptation decomposes the log-probability of each full sequence $$\mathbf{x}$$ into a sum of conditional log-probabilities, thereby demonstrating mathematical equivalence between the standard objective and its autoregressive sequential form:
$$ \hat{\theta} = \argmax_{\theta} \sum_{\mathbf{x} \in \mathcal{D}} \log \mathrm{Pr}_{\theta}(\mathbf{x}) = \argmax_{\theta} \sum_{\mathbf{x} \in \mathcal{D}} \sum_{i=0}^{i-1} \log \mathrm{Pr}_{\theta}(x_{i+1}|x_0,...,x_{i}) $$

Mathematical Equivalence of General and Sequential MLE Objectives

A researcher is modeling a series of coin flips. They observe the following sequence of outcomes: Heads, Tails, Heads, Heads. The researcher wants to find the best parameter for their model, where the parameter represents the probability of the coin landing on Heads. According to the principle of maximum likelihood estimation, which of the following parameter values best explains the observed data?

In the context of training a Large Language Model (LLM), the optimal parameters, denoted as $\hat{\theta}$, are found by maximizing the conditional log-likelihood across a dataset $D$. This supervised learning objective involves finding the parameters $\theta$ that maximize the sum of the logarithmic probabilities of the true outputs $y$ given the inputs $x$, where the probability $\text{Pr}_{\theta}(y|x)$ is predicted by the LLM. The formula is expressed as: $$\hat{\theta} = \underset{\theta}{\arg\max} \sum_{(x,y) \in D} \log \text{Pr}_{\theta}(y|x)$$ In some contexts, the input $x$ can be represented by other variables, such as a context $c$ and a latent variable $z$, leading to an equivalent formulation: $$\hat{\theta} = \underset{\theta}{\arg\max} \sum_{(x,y) \in D} \log \text{Pr}_{\theta}(y|c, z)$$

Parameter Estimation via Conditional Log-Likelihood Maximization

In machine learning, training a model often involves minimizing a 'loss' or 'cost' function. However, the principle of Maximum Likelihood Estimation (MLE) is defined as *maximizing* the likelihood of the observed data. Explain how these two seemingly opposite objectives (minimizing a loss vs. maximizing a likelihood) are fundamentally equivalent.

Equivalence of Maximizing Likelihood and Minimizing Loss

Equivalence of Squared Loss and Maximum Likelihood Estimation

To optimize a classification model using maximum likelihood estimation, we compare our predicted conditional probabilities with the actual labels. Assuming the dataset's labels $$\mathbf{Y}$$ are independent given the features $$\mathbf{X}$$, the probability of observing the correct labels is the product of individual probabilities:

$$P(\mathbf{Y} \mid \mathbf{X}) = \prod_{i=1}^n P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})$$

Because maximizing a product of many small probabilities is numerically unstable and computationally awkward, we take the negative logarithm. This transforms the problem into minimizing the negative log-likelihood, turning the product into a manageable sum of individual losses:

$$-\log P(\mathbf{Y} \mid \mathbf{X}) = \sum_{i=1}^n -\log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}) = \sum_{i=1}^n l(\mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)})$$

Learn Before

Related