Hierarchical Softmax
- Decomposing probabilities hierarchically, building nested categories of words in a tree-like structure over large vocabulary sets
- Instead of needing the amount of computations to be proportional , ( and to the number of hidden units )
- In a balanced tree, the tree has a depth of
- The probability of choosing a word is equal to the product of the probabilities of choosing the branch leading to that word at every node on a path from the root of the tree to the leaf containing the word
- To predict the conditional probabilities that are needed for each node, often a logistic regression model is used, providing the same context as input. Since the correct output is encoded in the training set, you can use supervised learning to train the models, typically done using a cross-entropy loss corresponding to maximizing the log-likelihood of the correct sequence of decisions.
- Because the output log-likelihood can be efficiently computed, its gradients may also be efficiently computed (both with respect to the outer parameters as well as the hidden layer activations)
- It is often possible to optimize the tree structure to minimize the expected number of computations, however, it is usually not practical to do so, since the computation of outer probabilities is only one part of the total computation in the model.
- Instead of optimizing a tree with branching factor of 2, it is simpler to define a tree with a depth of 2 and a branching factor of , i.e. defining a set of mutually exclusive word classes, capturing most of the benefit of the hierarchical strategy
- Question remains of how to best define the word classes, or the hierarchy in general. One could use discrete optimization to approximately optimize the partition of words
- Computing the probability of all words will remain expensive, and the tree structure does not provide an exact solution to selecting the most likely word in a given context
- Produces computational benefits at both training time and test time, but tends to give worse results than sampling-based methods, potentially due to a poor choice of word classes.
0
1
Contributors are:
Who are from:
Tags
Data Science
Foundations of Large Language Models Course
Computing Sciences
Learn After
Hierarchical Softmax Formula
A machine learning team is training a language model with a vocabulary of over one million unique words. They decide to replace the standard output layer, which calculates a probability for every single word, with an architecture that organizes words into a binary tree. In this new setup, the probability of a target word is calculated by multiplying the probabilities of the choices made at each node along the path from the tree's root to the word's specific leaf. What is the most likely trade-off the team will face by making this change?
Computational Cost of Output Architectures
Probability Calculation in a Hierarchical Output Layer