Based on the information provided in the case study, what is the final probability assigned to the word 'predict'?

Google

Hierarchical Softmax is an approach to reduce the computational cost of high-dimensional output layers over large vocabulary sets $$\mathbb{V}$$. It decomposes probabilities hierarchically by organizing words into a tree-like structure of nested categories. Instead of requiring computations proportional to the vocabulary size $$|\mathbb{V}|$$ and the number of hidden units $$n_h$$, a balanced tree reduces the required computations to a depth of $$O(\log|\mathbb{V}|)$$. The probability of selecting a specific word is calculated as the product of the conditional probabilities of choosing the correct branch at every node along the path from the root of the tree to the leaf containing that word. These conditional probabilities are often predicted using a logistic regression model with context $$C$$ as input. While it is possible to optimize a binary tree structure, it is often simpler to define a tree with a depth of 2 and a branching factor of $$\sqrt{|\mathbb{V}|}$$ to capture most of the computational benefits by creating mutually exclusive word classes.

Hierarchical Softmax

The Hierarchical Softmax is an efficient alternative to the standard Softmax function for models with large output vocabularies. It works by partitioning the vocabulary into classes or 'nodes'. The probability of a specific item `j`, which belongs to node `u`, is calculated by normalizing its score against the scores of all other items across all nodes. The formula is expressed as:

$$\alpha_{i,j} = \frac{\exp(\beta_{i,j})}{\sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[1]}} \exp(\beta_{i,j'}) + \cdots + \sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[u]}} \exp(\beta_{i,j'}) + \cdots + \sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[n_u]}} \exp(\beta_{i,j'})}$$

In this equation, the numerator represents the exponentiated score for item `j`. The denominator is the normalization term, calculated by summing the exponentiated scores of all items over all `n_u` partitions of the vocabulary.

Hierarchical Softmax Formula

A machine learning team is training a language model with a vocabulary of over one million unique words. They decide to replace the standard output layer, which calculates a probability for every single word, with an architecture that organizes words into a binary tree. In this new setup, the probability of a target word is calculated by multiplying the probabilities of the choices made at each node along the path from the tree's root to the word's specific leaf. What is the most likely trade-of

An engineering team is building a language model with a vocabulary size of 1,048,576 words (which is 2^20) and a hidden layer size of 512. During each training step, for a given input context, the model must compute the probability distribution over the entire vocabulary to calculate the loss for the target word. The team is comparing two output layer architectures: a standard layer that computes a score for every word, and a hierarchical layer that uses a balanced binary tree structure.

For a single training example, analyze and contrast the approximate number of computations required by each architecture to determine the probability of the correct target word. Explain which approach is more computationally efficient and why.

Learn Before

Related