1Cademy - A language model uses the following formula to calculate the probability of a specific word `j`, which belongs to partition `u` from a set of `n_u` partitions of the vocabulary: <br><br>`α_{i,j} = \frac{\exp(\beta_{i,j})}{\sum_{\mathbf{k}_{j} \in \mathbf{K}^{[1]}} \exp(\beta_{i,j}) + \cdots + \sum_{\mathbf{k}_{j} \in \mathbf{K}^{[u]}} \exp(\beta_{i,j}) + \cdots + \sum_{\mathbf{k}_{j} \in \mathbf{K}^{[n_u]}} \exp(\beta_{i,j})}`<br><br>Based on the structure of this formula, what is a key characteristic of its normalization term (the denominator)?

Learn Before

Hierarchical Softmax Formula

Multiple Choice

A language model uses the following formula to calculate the probability of a specific word j, which belongs to partition u from a set of n_u partitions of the vocabulary:

α_{i,j} = \frac{\exp(\beta_{i,j})}{\sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[1]}} \exp(\beta_{i,j'}) + \cdots + \sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[u]}} \exp(\beta_{i,j'}) + \cdots + \sum_{\mathbf{k}_{j'} \in \mathbf{K}^{[n_u]}} \exp(\beta_{i,j'})}

Based on the structure of this formula, what is a key characteristic of its normalization term (the denominator)?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related