Concept

Hierarchical Softmax

  • Decomposing probabilities hierarchically, building nested categories of words in a tree-like structure over large vocabulary sets V\mathbb{V}
  • Instead of needing the amount of computations to be proportional V|\mathbb{V}|, ( and to the number of hidden units nhn_h)
  • In a balanced tree, the tree has a depth of O(logV)O(log|\mathbb{V}|)
  • The probability of choosing a word is equal to the product of the probabilities of choosing the branch leading to that word at every node on a path from the root of the tree to the leaf containing the word
  • To predict the conditional probabilities that are needed for each node, often a logistic regression model is used, providing the same context CC as input. Since the correct output is encoded in the training set, you can use supervised learning to train the models, typically done using a cross-entropy loss corresponding to maximizing the log-likelihood of the correct sequence of decisions.
  • Because the output log-likelihood can be efficiently computed, its gradients may also be efficiently computed (both with respect to the outer parameters as well as the hidden layer activations)
  • It is often possible to optimize the tree structure to minimize the expected number of computations, however, it is usually not practical to do so, since the computation of outer probabilities is only one part of the total computation in the model.
  • Instead of optimizing a tree with branching factor of 2, it is simpler to define a tree with a depth of 2 and a branching factor of V\sqrt{|\mathbb{V}|}, i.e. defining a set of mutually exclusive word classes, capturing most of the benefit of the hierarchical strategy
  • Question remains of how to best define the word classes, or the hierarchy in general. One could use discrete optimization to approximately optimize the partition of words
  • Computing the probability of all V|\mathbb{V}| words will remain expensive, and the tree structure does not provide an exact solution to selecting the most likely word in a given context
  • Produces computational benefits at both training time and test time, but tends to give worse results than sampling-based methods, potentially due to a poor choice of word classes.

0

1

Updated 2025-10-07

Tags

Data Science

Foundations of Large Language Models Course

Computing Sciences