Learn Before
Computational Cost of Output Architectures
An engineering team is building a language model with a vocabulary size of 1,048,576 words (which is 2^20) and a hidden layer size of 512. During each training step, for a given input context, the model must compute the probability distribution over the entire vocabulary to calculate the loss for the target word. The team is comparing two output layer architectures: a standard layer that computes a score for every word, and a hierarchical layer that uses a balanced binary tree structure.
For a single training example, analyze and contrast the approximate number of computations required by each architecture to determine the probability of the correct target word. Explain which approach is more computationally efficient and why.
0
1
Tags
Data Science
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Hierarchical Softmax Formula
A machine learning team is training a language model with a vocabulary of over one million unique words. They decide to replace the standard output layer, which calculates a probability for every single word, with an architecture that organizes words into a binary tree. In this new setup, the probability of a target word is calculated by multiplying the probabilities of the choices made at each node along the path from the tree's root to the word's specific leaf. What is the most likely trade-off the team will face by making this change?
Computational Cost of Output Architectures
Probability Calculation in a Hierarchical Output Layer