A machine learning model is designed to predict the next word in a sentence and has a vocabulary of 200,000 possible words. The model produces an unnormalized score for each of these words. Explain why calculating the exact normalization constant (the partition function) to turn these scores into a valid probability distribution is a significant computational challenge.

Google

The partition function is an integral or sum over unnormalized probabilities. This function, often denoted by Z, is used to normalize an unnormalized distribution into a valid distribution.
$$ p(\textbf{x}) =  \frac{1}{Z}  \tilde{p}(\textbf{x}) .  $$

where Z is: 

$$ Z =  \int   \tilde{p}(\textbf{x}) d\textbf{x}.  $$ 
or
$$ Z =  \sum_{x}  \tilde{p}(\textbf{x}) d\textbf{x}.  $$

Z = partition function.  
$$ \tilde{p}(\textbf{x})$$ = Unnormalized distribution. 
$$p(\textbf{x}) $$= Normalized distribution. 
In the context of deep learning, Z is usually intractable, meaning that it often needs to be approximated. 


The Partition Function (introduction).

The gradient of the log-likelihood which is used in Maximum Likelihood Estimation can be decomposed into the following:
$$\nabla_{\boldsymbol \theta} \log p(\mathbf{x}; \boldsymbol \theta) = \nabla_{\boldsymbol \theta} \log \tilde{p}(\mathbf{x}; \boldsymbol \theta) - \nabla_{\boldsymbol \theta } \log Z(\boldsymbol \theta)$$
Where $\tilde{p}(\mathbf{x}; \boldsymbol \theta)$ is the unnormalized probabilioty density and $Z$ is the partition function. This is well-known as the decomposition into the positive phase and negative phase of learning. Due to the reliance of the partition function on the parameters, learning models by maximum likelihood is particularly difficult. 

Log-Likelihood Gradient

What makes learning undirected models by maximum likelihood particularly diﬃcult is that the partition function depends on the parameters：

         ∇θlog p(x; θ) = ∇θlog ˜p(x; θ) − ∇θlog Z(θ)


This identity is applicable only under certain regularity conditions :

         ∇θlog Z = Ex∼p(x)∇θlog ˜p(x) 


The Monte Carlo approach to learning undirected models provides an intuitive framework in which we can consider both positive and negative phases

Given the scenario below, calculate the value of the normalizing constant required to turn the scores into a valid probability distribution, and then determine the final probability for 'Outcome A'.

Normalizing Model Outputs

A model produces unnormalized scores for three possible outcomes: {Outcome A: 8, Outcome B: 10, Outcome C: 2}. To convert these scores into a valid probability distribution, a normalization constant must be calculated by summing all the unnormalized scores. What is the final, normalized probability for Outcome B?

Learn Before

Related