1Cademy - Hierarchical Softmax

Learn Before

Solutions to the High Computational Cost of High-Dimensional Outputs

Concept

Hierarchical Softmax

Hierarchical Softmax is an approach to reduce the computational cost of high-dimensional output layers over large vocabulary sets $\mathbb{V}$ . It decomposes probabilities hierarchically by organizing words into a tree-like structure of nested categories. Instead of requiring computations proportional to the vocabulary size $|\mathbb{V}|$ and the number of hidden units $n_h$ , a balanced tree reduces the required computations to a depth of $O(\log|\mathbb{V}|)$ . The probability of selecting a specific word is calculated as the product of the conditional probabilities of choosing the correct branch at every node along the path from the root of the tree to the leaf containing that word. These conditional probabilities are often predicted using a logistic regression model with context $C$ as input. While it is possible to optimize a binary tree structure, it is often simpler to define a tree with a depth of 2 and a branching factor of $\sqrt{|\mathbb{V}|}$ to capture most of the computational benefits by creating mutually exclusive word classes.

0

1

Updated 2026-05-16

Contributors are:

Who are from:

University of Michigan - Ann Arbor

🏆 3

Google

✔️ 2

References

Learn Before

Related

Learn After