Learn Before
Hierarchical Softmax Formula
The Hierarchical Softmax is an efficient alternative to the standard Softmax function for models with large output vocabularies. It works by partitioning the vocabulary into classes or 'nodes'. The probability of a specific item j, which belongs to node u, is calculated by normalizing its score against the scores of all other items across all nodes. The formula is expressed as:
In this equation, the numerator represents the exponentiated score for item j. The denominator is the normalization term, calculated by summing the exponentiated scores of all items over all n_u partitions of the vocabulary.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Related
Hierarchical Softmax Formula
A machine learning team is training a language model with a vocabulary of over one million unique words. They decide to replace the standard output layer, which calculates a probability for every single word, with an architecture that organizes words into a binary tree. In this new setup, the probability of a target word is calculated by multiplying the probabilities of the choices made at each node along the path from the tree's root to the word's specific leaf. What is the most likely trade-off the team will face by making this change?
Computational Cost of Output Architectures
Probability Calculation in a Hierarchical Output Layer