Learn Before
Case Study

Computational Cost of Output Architectures

An engineering team is building a language model with a vocabulary size of 1,048,576 words (which is 2^20) and a hidden layer size of 512. During each training step, for a given input context, the model must compute the probability distribution over the entire vocabulary to calculate the loss for the target word. The team is comparing two output layer architectures: a standard layer that computes a score for every word, and a hierarchical layer that uses a balanced binary tree structure.

For a single training example, analyze and contrast the approximate number of computations required by each architecture to determine the probability of the correct target word. Explain which approach is more computationally efficient and why.

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Data Science

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science