Learn Before
Formula

Hierarchical Softmax Formula

The Hierarchical Softmax is an efficient alternative to the standard Softmax function for models with large output vocabularies. It works by partitioning the vocabulary into classes or 'nodes'. The probability of a specific item j, which belongs to node u, is calculated by normalizing its score against the scores of all other items across all nodes. The formula is expressed as:

αi,j=exp(βi,j)kjK[1]exp(βi,j)++kjK[u]exp(βi,j)++kjK[nu]exp(βi,j)\alpha_{i,j} = \frac{\exp(\beta_{i,j})}{\sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[1]}} \exp(\beta_{i,j'}) + \cdots + \sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[u]}} \exp(\beta_{i,j'}) + \cdots + \sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[n_u]}} \exp(\beta_{i,j'})}

In this equation, the numerator represents the exponentiated score for item j. The denominator is the normalization term, calculated by summing the exponentiated scores of all items over all n_u partitions of the vocabulary.

Image 0

0

1

Updated 2025-09-29

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences