Softmax Function
To convert raw, unnormalized outputs into valid probabilities, the softmax function applies an exponential function to each component and then normalizes them by their sum. The exponentiation ensures that all probabilities are non-negative, while the division ensures that they sum to . Mathematically, the predicted probability distribution is defined as:
This guarantees that and . Unlike other normalizations or the probit model, the softmax function preserves order and leads to a more well-behaved optimization problem.
0
1
Contributors are:
Who are from:
References
Wikipedia
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Dive into Deep Learning
Tags
Data Science
Foundations of Large Language Models Course
Computing Sciences
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
D2L
Dive into Deep Learning @ D2L
Related
Linear vs. Non-Linear Activation Functions
Sigmoid/Logistic Function
TanH/Hyperbolic Tangent Function
Swish Function
ReLU (Rectified Linear Unit)
ELU (Exponential Linear Unit)
Which activation function is represented by each of these plots?
Which of the following introduces nonlinearity into neural networks?
Softmax Function
Conditional Probability Formula for Autoregressive Models using Softmax
A language model is predicting the next word in a sequence. After processing the context, it has assigned the following unnormalized scores to a set of four candidate words: 'mat' (score=6.0), 'rug' (score=3.0), 'floor' (score=0.5), and 'chair' (score=0.5). To convert these scores into a valid probability distribution over this set, what is the final probability assigned to the word 'mat'?
A language model is evaluating three candidate tokens (A, B, C) to follow a given context. Initially, their scores are: Token A = 4, Token B = 4, Token C = 2. If the score for Token C is increased to 12, while the scores for Token A and Token B remain unchanged, how does this affect the normalized probabilities of Token A and Token B?
Comparing Model Confidence via Probability Normalization
Softmax Function
Probit Model
Softmax Function
Learn After
Pros and Cons of Softmax Function
Softmax Regression (Activation)
Parameterized Softmax Layer
Plackett-Luce Selection Probability Formula
Conditional Probability Formula for Autoregressive Models using Softmax
A neural network's final layer produces the raw output scores (logits)
[2.0, 1.0, 0.1]for three possible classes. To convert these scores into class probabilities, a function is applied that first exponentiates each score and then normalizes these new values by dividing each by their sum. What is the resulting probability distribution? (Values are rounded to three decimal places).A function is used to convert a vector of raw, unnormalized scores
z = [z_1, z_2, ..., z_K]into a probability distribution. This function operates by first applying the standard exponential function to each score and then normalizing these new values by dividing each by their sum. If a constant valueCis added to every score in the input vectorz, resulting in a new vectorz' = [z_1+C, z_2+C, ..., z_K+C], how will the resulting output probability distribution be affected?Consider two input vectors of raw scores (logits) for a 3-class classification problem: Vector A =
[1, 2, 3]and Vector B =[1, 5, 10]. Both vectors are passed through a function that exponentiates each score and then normalizes the results by dividing by their sum. How will the resulting probability distribution for Vector B compare to the one for Vector A?You’re reviewing an internal evaluation script tha...
Your team is building an internal tool that ranks ...
You’re reviewing an internal LLM evaluation pipeli...
Reconciling Training Log-Likelihood with Inference-Time Sequence Selection
Explaining a Counterintuitive Decoding Outcome Using Softmax, Next-Token Conditionals, and Sequence Log-Probability
Diagnosing a “High-Confidence Wrong Token” Bug in Autoregressive Scoring
Investigating a Production Scoring Bug: Softmax Normalization vs. Autoregressive Sequence Log-Probability
Design a Correct Sequence-Scoring Function for Autoregressive LLM Outputs
Root-Cause Analysis: Why a “More Likely” Token-by-Token Completion Loses on Total Sequence Score
Auditing a Candidate Completion Using Softmax Next-Token Probabilities and Autoregressive Log-Probability
Derivative of Softmax Cross-Entropy Loss with Respect to Logits
Numerical Overflow in Softmax Function