Any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution deﬁned by the training set and the probability distribution deﬁned by model.

A Broad Definition of Cross Entropy

1. A perfect classifier would assign probability 1 to the correct outcome (y=1 or y=0) and probability 0 to the incorrect outcome. That means if y equals 1, the higher ˆy is
(the closer it is to 1), the better the classifier; the lower y^ is (the closer it is to 0),
the worse the classifier. If y equals 0, instead, the higher 1 − y^ is (closer to 1), the
better the classifier. The negative log of ˆy (if the true y equals 1) or 1−y^ (if the true
y equals 0) is a convenient loss metric since it goes from 0 (negative log of 1, no
loss) to infinity (negative log of 0, infinite loss). 

2. This loss function also ensures that as the probability of the correct answer is maximized, the probability of the incorrect answer is minimized; since the two sum to one, any increase in the probability of the correct answer is coming at the expense of the incorrect answer. 

Why we want to minimize cross-entropy loss?

The training objective of a denoising autoencoder is to identify the optimal parameters for the encoder ($$\theta$$) and decoder ($$\omega$$) to minimize reconstruction error. During training, a corrupted input $$\mathbf{x}_{\mathrm{noise}}$$ is generated by adding noise to the original input $$\mathbf{x}$$. The model processes this noisy input, and the loss function—frequently chosen as cross-entropy loss—measures how effectively the decoder recovers the original $$\mathbf{x}$$. The objective to find the optimal parameters, $$\hat{\theta}$$ and $$\hat{\omega}$$, is mathematically defined as:

$$(\hat{\theta},\hat{\omega}) = \arg\min_{\theta,\omega} \mathrm{Loss}(\mathrm{Model}_{\theta,\omega}(\mathbf{x}_{\mathrm{noise}}),\mathbf{x})$$

Denoising Autoencoder Training Objective

The training objective for Masked Language Modeling (MLM) involves finding the optimal model parameters, $$\widehat{\mathbf{W}}$$ and $$\hat{\theta}$$, that minimize the total cross-entropy loss over a given dataset $$\mathcal{D}$$. For each modified text sequence $$\bar{\mathbf{x}}$$, the loss is computed only for the set of selected positions $$\mathcal{A}(\mathbf{x})$$ by comparing the model's predicted probability distribution $$\mathbf{p}_{i}^{\mathbf{W},\theta}$$ with the ground-truth distribution $$\mathbf{p}_{i}^{\mathrm{gold}}$$ at each selected position $$i$$. The complete optimization objective is formulated as:
$$ (\widehat{\mathbf{W}},\hat{\theta}) = \arg\min_{\mathbf{W},\theta} \sum_{\mathbf{x} \in \mathcal{D}} \sum_{i \in \mathcal{A}(\mathbf{x})} \mathrm{LogCrossEntropy}(\mathbf{p}_{i}^{\mathbf{W},\theta},\mathbf{p}_{i}^{\mathrm{gold}}) $$

MLM Training Objective using Cross-Entropy Loss

Consider a binary classification task where the correct label for a specific instance is `1`. A model makes two different predictions for this instance: Prediction A is `0.9` and Prediction B is `0.6`. According to the cross-entropy loss function, which statement accurately compares the loss for these two predictions?

A machine learning model is performing a binary classification task. For a single data point, the true label is `0`, and the model predicts a probability of `0.2` for the positive class (class `1`). Calculate the cross-entropy loss for this specific prediction. You may leave your answer in terms of the natural logarithm (ln).

Calculating Cross-Entropy Loss

A machine learning model is being trained to classify emails as 'Spam' (label=1) or 'Not Spam' (label=0). The model outputs a probability score indicating the likelihood of an email being spam. During one training step, the model makes the following predictions on four emails. Based on the principles of the cross-entropy loss function, which single email will contribute the most to the total loss for this step, and why?

Analyzing Model Errors with Cross-Entropy Loss

To train a language model, such as a decoder-only architecture, the standard approach is to minimize a loss function over a collection of token sequences. This function, denoted as $$\mathcal{L}(\mathbf{p}_{i+1}^{\theta}, \mathbf{p}_{i+1}^{\mathrm{gold}})$$, measures the discrepancy between the model's predicted probability distribution $$\mathbf{p}_{i+1}^{\theta}$$ and the true, gold-standard distribution $$\mathbf{p}_{i+1}^{\mathrm{gold}}$$ at each position. In natural language processing, this difference is typically quantified using the log-scale cross-entropy loss.

Loss Function for Language Modeling

Object detection evaluates the classes of anchor boxes using a cross-entropy loss function, similar to standard image classification. This loss penalizes incorrect class predictions for each anchor box generated by the model.

Anchor Box Class Loss in Object Detection

The cross-entropy loss function works very well for models that predict binary classes (aka the output is between 0 and 1). It is defined as -[y*log(y-hat) +(1-y)*log(1-(y-hat))]. If y=0 the left side of the function is dropped and the right side, -log(1-(y-hat)), is used. Otherwise if y=1 the right side of the function is dropped and it uses -log(y-hat). In both instances this loss function encourages probabilities that are close to the true probability. 

Google

University of Michigan - Ann Arbor

It's an important principal in selecting the model and describing the math formula of a model's loss function. Let $p_{model}(x,\theta)$ be a parametric family of possibility distribution over the same space indexed by $\theta$. The maximum likelihood estimator for $\theta$ is then defuned as 
$\theta_{ML} = argmax_{\theta}p_{model}(X;\theta) $

Maximum Likelihood Estimation

An objective or scoring function can be the source of an inference failure when it does not assign a higher score to the correct output than to the system output. In that case, the learning algorithm that estimates the score should be improved rather than the search algorithm.

Objective Function

Reference of Foundations of Large Language Models Course

One way to interpret maximum likelihood estimation is to view it as minimizing the dissimilarity between the empirical distribution and the model distribution. We can measure the degree of the dissimilarity using KL divergence.

Relationship between KL Divergence and MLE

Cross-entropy loss

Mean Squared Error is simply the variance of a potentially biased estimator $\hat{\theta}_m$ of the Expected Value or Mean $\theta$
$$MSE(\hat{\theta}_m) = \mathbb{E}((\hat{\theta}_m - \theta)^2)$$
$$ = Bias^2(\hat{\theta}_m) + Var(\hat{\theta}_m)$$

Mean Squared Error

Under appropriate conditions, as the number of training examples approaches inﬁnity, the maximum likelihood estimate of a parameter converges to the true value of the parameter.

The property of consistency of maximum likelihood

One consistent estimator may obtain lower generalization error for a ﬁxed number of samples $m$, or equivalently, may require fewer examples to obtain a ﬁxed level of generalization error.

Statistical Eﬃciency Principal of MLE

- The true distribution $p_{\text{data}}$ must lie within the model family $p_{\text{model}}(.;\theta)$. Otherwise, no esitmator can recover $p_{\text{data}}$.
- The true distribution $p_{\text{data}}$ must correspond exactly to one value of $\theta$. Otherwise, the maximum likelihood can recover the correct $p_{\text{data}}$ but will not be able to determine which value of $\theta$ was used by the data generating process.

Maximum Likelihood Estimator Properties

The gradient of the log-likelihood which is used in Maximum Likelihood Estimation can be decomposed into the following:
$$\nabla_{\boldsymbol \theta} \log p(\mathbf{x}; \boldsymbol \theta) = \nabla_{\boldsymbol \theta} \log \tilde{p}(\mathbf{x}; \boldsymbol \theta) - \nabla_{\boldsymbol \theta } \log Z(\boldsymbol \theta)$$
Where $\tilde{p}(\mathbf{x}; \boldsymbol \theta)$ is the unnormalized probabilioty density and $Z$ is the partition function. This is well-known as the decomposition into the positive phase and negative phase of learning. Due to the reliance of the partition function on the parameters, learning models by maximum likelihood is particularly difficult. 

Log-Likelihood Gradient

The training objective under the Maximum Likelihood Estimation (MLE) framework is to find the model parameters, $\tilde{\theta}$, that maximize the total log-probability of all sequences in a dataset $\mathcal{D}$. This is achieved by summing the log-probabilities of each individual sequence, `seq`, as calculated by the model parameterized by $\theta$. The general objective is formally expressed as: $$ \tilde{\theta} = \arg\max_{\theta} \sum_{\text{seq} \in \mathcal{D}} \log \text{Pr}_{\theta}(\text{seq}) $$ For datasets composed of input-output pairs $(\mathbf{x}, \mathbf{y})$, this objective can be specified as maximizing the joint log-probability of the concatenated sequences: $$ \tilde{\theta} = \underset{\theta}{\arg\max} \sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}} \log \text{Pr}_{\theta}(\text{seq}_{\mathbf{x},\mathbf{y}}) $$ This approach is equivalent to maximizing the sum of the log-likelihoods for all data points in the training set.

Maximum Likelihood Training Objective for a Dataset of Sequences

Kullback-Leibler (KL) divergence, also known as relative entropy, measures how one probability distribution diverges from a second, reference probability distribution. For discrete probability distributions $P$ and $Q$ defined on the same probability space, the KL divergence from $Q$ to $P$, denoted $D_{\text{KL}}(P \|\| Q)$, is the expectation of the logarithmic difference between the probabilities given by the two distributions, where the expectation is taken using the probabilities of $P$. The formula is: $$ D_{\text{KL}}(P \|\| Q) = \sum_{\mathbf{x}} P(\mathbf{x}) \log\left(\frac{P(\mathbf{x})}{Q(\mathbf{x})}\right) = \mathbb{E}_{\mathbf{x} \sim P} [\log P(\mathbf{x}) - \log Q(\mathbf{x})] $$ KL divergence is non-negative ($D_{\text{KL}}(P \|\| Q) \ge 0$) and is zero if and only if $P$ and $Q$ are identical. It is an asymmetric measure, meaning that $D_{\text{KL}}(P \|\| Q)$ is generally not equal to $D_{\text{KL}}(Q \|\| P)$.

Kullback-Leibler Divergence

You are evaluating two simple probabilistic models, Model A and Model B, on their ability to represent a single observed data sequence from a series of coin flips: `[Heads, Tails, Heads]`. The models assume each flip is independent, so the probability of the sequence is the product of the probabilities of the individual outcomes. Given the probabilities assigned by each model below, which model provides a better fit for the observed data according to the principle of maximum likelihood estimation? Justify your answer with a calculation.

Model Selection via Likelihood

The objective of training a model is to find the set of parameters, $$\hat{\theta}$$, that minimizes the total loss across an entire dataset of sequences, $$\mathcal{D}$$. This optimization problem is formally expressed as:
$$ \hat{\theta} = \argmin_{\theta} \sum_{\mathbf{x} \in \mathcal{D}} \mathrm{Loss}_{\theta}(\mathbf{x}) $$
This loss minimization objective is mathematically equivalent to the principle of Maximum Likelihood Estimation (MLE). Specifically, when the loss function is defined as the negative log-likelihood of the data, minimizing the loss is the same as maximizing the likelihood of the data given the model parameters.

Training Objective as Loss Minimization over a Dataset

The general maximum likelihood estimation formulation for a dataset $$\mathcal{D}$$ can be re-expressed for sequential data by applying the chain rule of probability. This adaptation decomposes the log-probability of each full sequence $$\mathbf{x}$$ into a sum of conditional log-probabilities, thereby demonstrating mathematical equivalence between the standard objective and its autoregressive sequential form:
$$ \hat{\theta} = \argmax_{\theta} \sum_{\mathbf{x} \in \mathcal{D}} \log \mathrm{Pr}_{\theta}(\mathbf{x}) = \argmax_{\theta} \sum_{\mathbf{x} \in \mathcal{D}} \sum_{i=0}^{i-1} \log \mathrm{Pr}_{\theta}(x_{i+1}|x_0,...,x_{i}) $$

Mathematical Equivalence of General and Sequential MLE Objectives

A researcher is modeling a series of coin flips. They observe the following sequence of outcomes: Heads, Tails, Heads, Heads. The researcher wants to find the best parameter for their model, where the parameter represents the probability of the coin landing on Heads. According to the principle of maximum likelihood estimation, which of the following parameter values best explains the observed data?

In the context of training a Large Language Model (LLM), the optimal parameters, denoted as $\hat{\theta}$, are found by maximizing the conditional log-likelihood across a dataset $D$. This supervised learning objective involves finding the parameters $\theta$ that maximize the sum of the logarithmic probabilities of the true outputs $y$ given the inputs $x$, where the probability $\text{Pr}_{\theta}(y|x)$ is predicted by the LLM. The formula is expressed as: $$\hat{\theta} = \underset{\theta}{\arg\max} \sum_{(x,y) \in D} \log \text{Pr}_{\theta}(y|x)$$ In some contexts, the input $x$ can be represented by other variables, such as a context $c$ and a latent variable $z$, leading to an equivalent formulation: $$\hat{\theta} = \underset{\theta}{\arg\max} \sum_{(x,y) \in D} \log \text{Pr}_{\theta}(y|c, z)$$

Parameter Estimation via Conditional Log-Likelihood Maximization

In machine learning, training a model often involves minimizing a 'loss' or 'cost' function. However, the principle of Maximum Likelihood Estimation (MLE) is defined as *maximizing* the likelihood of the observed data. Explain how these two seemingly opposite objectives (minimizing a loss vs. maximizing a likelihood) are fundamentally equivalent.

Equivalence of Maximizing Likelihood and Minimizing Loss

Minimizing the mean squared error is mathematically equivalent to performing maximum likelihood estimation for a linear model under the assumption of additive Gaussian noise. In the negative log-likelihood objective for linear regression, if the standard deviation $$\sigma$$ is assumed to be fixed, the term $$\frac{1}{2} \log(2 \pi \sigma^2)$$ becomes a constant that can be ignored during optimization. The remaining term is identical to the squared error loss, except for the multiplicative constant $$\frac{1}{\sigma^2}$$, which does not alter the location of the minimum.

Equivalence of Squared Loss and Maximum Likelihood Estimation

To optimize a classification model using maximum likelihood estimation, we compare our predicted conditional probabilities with the actual labels. Assuming the dataset's labels $$\mathbf{Y}$$ are independent given the features $$\mathbf{X}$$, the probability of observing the correct labels is the product of individual probabilities:

$$P(\mathbf{Y} \mid \mathbf{X}) = \prod_{i=1}^n P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})$$

Because maximizing a product of many small probabilities is numerically unstable and computationally awkward, we take the negative logarithm. This transforms the problem into minimizing the negative log-likelihood, turning the product into a manageable sum of individual losses:

$$-\log P(\mathbf{Y} \mid \mathbf{X}) = \sum_{i=1}^n -\log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}) = \sum_{i=1}^n l(\mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)})$$

Negative Log-Likelihood Objective for Softmax Regression

To train the parameters W and B of the logistic regression model, you need to define a cost function.

$$J(w, b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)})$$

$$=-\frac{1}{m} \sum_{i=1}^{m} [y^{(i)}log(\hat{y}^{(i)}) + (1 - y^{(i)})log(1 - \hat{y}^{(i)})]$$

This loss function is Convex.

Logistic Regression Cost Function

A machine learning model is being trained for a prediction task. A key metric, the objective function, is tracked over time. The value of this function represents the magnitude of the model's error. A graph of this process shows the objective function's value consistently decreasing as the number of training iterations increases. What is the most accurate interpretation of this trend?

An engineer is training a model to predict housing prices. After running the training process for several hours, they plot the value of the model's error measurement over time. They observe that the error value remains very high and does not decrease, staying almost flat throughout the entire process. Based on this observation, analyze the effectiveness of the training process and explain what this trend indicates about the model's ability to achieve its primary goal.

Diagnosing Model Training Issues

A machine learning model is designed to predict the price of a product. For a small sample of three products, the model predicts prices of [$55, $90, $125], while the actual prices are [$50, $100, $120]. The model's performance is measured by an objective function defined as the average of the squared differences between the predicted and actual values. First, calculate the value of this objective function for the given sample. Second, explain what a lower value of this function would signify about the model's future predictions.

Calculating and Interpreting a Model's Objective Function

When a primary objective function—such as the error rate in classification—is difficult to optimize directly due to non-differentiability or other mathematical complications, machine learning models instead optimize a surrogate objective. This proxy function is chosen because it is easier to compute gradients for while still aligning with the ultimate goal.

Surrogate Objective

A loss function is a specific type of objective function where the convention is that lower values indicate better model performance. Optimization algorithms actively seek to minimize the loss function to improve the model; any objective where higher is better can be converted into a loss function by simply flipping its sign.

Loss Function

A fundamental requirement for training modern machine learning and deep learning models is the use of differentiable objectives. Because the optimization process typically relies on gradient-based methods, such as minibatch stochastic gradient descent, the objective function (or loss function) must be mathematically differentiable with respect to the model's parameters. This differentiability allows the optimization algorithm to compute gradients, which provide the direction and magnitude of the parameter updates needed to minimize the error and improve the model's predictive performance.

Differentiable Objectives

First-order optimization algorithms rely solely on the value and gradient of the objective function. In contrast, second-order optimization algorithms also utilize information about the function's curvature, often represented by the Hessian matrix. By accounting for curvature, these methods can automatically adjust the optimization step, providing a way to circumvent the difficulties of manually tuning a learning rate.

Second-Order Optimization Algorithm

In the context of optimization, the curvature of an objective function refers to the rate at which its gradient changes, conceptually captured by its second-order derivatives. Geometrically, curvature indicates how rapidly the surface of the objective function bends. Understanding this property provides useful intuition for adjusting optimization step sizes: in regions of high curvature where the gradient changes quickly, smaller step sizes help avoid overshooting the optimal solution or diverging; in regions of low curvature, larger step sizes can safely accelerate progress. While computing curvature directly is often too computationally expensive for deep learning, it forms the theoretical foundation for designing advanced adaptive optimization algorithms that automatically adjust their learning rates.

Objective Function Curvature

A convex quadratic objective function is defined by the general mathematical form $$h(\mathbf{x}) = \frac{1}{2} \mathbf{x}^	op \mathbf{Q} \mathbf{x} + \mathbf{x}^	op \mathbf{c} + b$$, where the matrix $$\mathbf{Q}$$ is positive definite ($$\mathbf{Q} \succ 0$$). Because $$\mathbf{Q}$$ possesses strictly positive eigenvalues, this function has a unique global minimizer located at $$\mathbf{x}^* = -\mathbf{Q}^{-1} \mathbf{c}$$. The function can be rewritten by centering it around this minimizer, yielding $$h(\mathbf{x}) = \frac{1}{2} (\mathbf{x} - \mathbf{Q}^{-1} \mathbf{c})^	op \mathbf{Q} (\mathbf{x} - \mathbf{Q}^{-1} \mathbf{c}) + b - \frac{1}{2} \mathbf{c}^	op \mathbf{Q}^{-1} \mathbf{c}$$. Furthermore, its gradient is given by $$\partial_{\mathbf{x}} h(\mathbf{x}) = \mathbf{Q} (\mathbf{x} - \mathbf{Q}^{-1} \mathbf{c})$$, which geometrically represents the distance from the point $$\mathbf{x}$$ to the minimizer scaled by the curvature matrix $$\mathbf{Q}$$.

Learn Before

Related

Learn After