When calculating the cross-entropy loss for a batch of examples, it is computationally inefficient to use a for-loop to iterate over each example to evaluate the negative log-likelihood. Instead, advanced array indexing can be used to extract the model's predicted probability assigned to the true label for each example. Because the true labels $$\mathbf{y}$$ are typically provided as a vector of integer class indices, these indices can directly select the corresponding predicted probabilities from the prediction matrix $$\hat{\mathbf{y}}$$. This efficiently bypasses the need for explicitly multiplying a one-hot encoded label matrix by the predictions.

Indexing Predicted Probabilities for Cross-Entropy Loss

To evaluate the predictions of the ResNet-18 model on the CIFAR-10 dataset, the cross-entropy loss function is utilized. In some framework implementations, this loss function is explicitly configured to compute the unreduced loss (e.g., `reduction="none"`), returning individual loss values for each example rather than a single aggregated scalar.

CIFAR-10 Cross-Entropy Loss Configuration

For a pair of a one-hot label vector $$\mathbf{y}$$ and a model's predicted probability distribution $$\hat{\mathbf{y}}$$ over $$q$$ classes, the cross-entropy loss function is defined as:

$$l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j$$

Because $$\mathbf{y}$$ is a one-hot vector, the sum vanishes for all but the coordinate corresponding to the true class. This loss is bounded below by $$0$$ (since probabilities cannot exceed $$1$$ and their negative logarithm cannot be lower than $$0$$), and it only equals $$0$$ if the model predicts the true label with absolute certainty. However, reaching a probability of exactly $$1$$ requires infinite logits, so the loss is never completely $$0$$ for finite weights. Conversely, assigning an output probability of $$0$$ to the true label would incur an infinite loss ($$-\log 0 = \infty$$).

Claude

University of Michigan - Ann Arbor

Softmax regression is the generalization of logistic regression to multiple classes. In other words, each data point belongs to one of multiple classes (rather than just two options, as is the case for logistic regression). Hence softmax regression is also called multi-class logistic regression or multinomial logistic regression.

Softmax Regression (Activation)

To optimize a classification model using maximum likelihood estimation, we compare our predicted conditional probabilities with the actual labels. Assuming the dataset's labels $$\mathbf{Y}$$ are independent given the features $$\mathbf{X}$$, the probability of observing the correct labels is the product of individual probabilities:

$$P(\mathbf{Y} \mid \mathbf{X}) = \prod_{i=1}^n P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})$$

Because maximizing a product of many small probabilities is numerically unstable and computationally awkward, we take the negative logarithm. This transforms the problem into minimizing the negative log-likelihood, turning the product into a manageable sum of individual losses:

$$-\log P(\mathbf{Y} \mid \mathbf{X}) = \sum_{i=1}^n -\log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}) = \sum_{i=1}^n l(\mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)})$$

Negative Log-Likelihood Objective for Softmax Regression

Dive into Deep Learning

https://www.kdnuggets.com/2016/07/softmax-regression-related-logistic-regression.html

What is Softmax Regression and How is it Related to Logistic Regression?

https://towardsdatascience.com/softmax-function-simplified-714068bf8156

The Softmax Function, Simplified

The output layer of softmax regression generates a probability distribution $$\hat{\mathbf{y}}$$, where each entry $$\hat{y}_i$$ represents the predicted probability that the input belongs to a particular class. These probabilities are calculated by applying the softmax activation function to the raw output scores $$\mathbf{o}$$. To ensure the outputs represent valid probabilities, the operation exponentiates each score and normalizes it by the sum of all exponentiated scores:

$$\hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}$$

This transforms the unconstrained linear model outputs into a valid probability distribution.

Output Layer of Softmax Regression

```
import numpy as np
def softmax_numpy(scores):
  return np.exp(scores)/sum(np.exp(scores), axis=0)
```

Implementation of Softmax Regression Using Numpy

```
import tensorflow as tf
def softmax_tensorflow(scores):
  return tf.exp(scores)/tf.reducesum(tf.exp(scores), 1) 
```

Implementation of Softmax Regression Using Tensorflow

Cross-Entropy Loss for Softmax Regression

To maximize computational efficiency, the forward pass of a softmax regression model is typically vectorized across minibatches. For a minibatch of inputs $$\mathbf{X} \in \mathbb{R}^{n 	imes d}$$ containing $$n$$ examples with $$d$$ features, and parameters $$\mathbf{W} \in \mathbb{R}^{d 	imes q}$$ (weights) and $$\mathbf{b} \in \mathbb{R}^{1 	imes q}$$ (biases), the unnormalized logits are computed using the affine transformation $$\mathbf{O} = \mathbf{X} \mathbf{W} + \mathbf{b}$$. The softmax function is then applied rowwise to $$\mathbf{O}$$ to yield the normalized class probabilities $$\hat{\mathbf{Y}} = \mathrm{softmax}(\mathbf{O})$$ for the entire batch simultaneously.

Learn Before

Related

Learn After