Formula

Softmax Function

To convert raw, unnormalized outputs o\mathbf{o} into valid probabilities, the softmax function applies an exponential function to each component and then normalizes them by their sum. The exponentiation ensures that all probabilities are non-negative, while the division ensures that they sum to 11. Mathematically, the predicted probability distribution y^\hat{\mathbf{y}} is defined as:

y^=softmax(o)extrmwherey^i=exp(oi)jexp(oj)\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o}) \quad extrm{where}\quad \hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}

This guarantees that 0y^i10 \le \hat{y}_i \le 1 and jy^j=1\sum_j \hat{y}_j = 1. Unlike other normalizations or the probit model, the softmax function preserves order and leads to a more well-behaved optimization problem.

0

1

Updated 2026-05-03

Tags

Data Science

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

D2L

Dive into Deep Learning @ D2L

Learn After