What makes learning undirected models by maximum likelihood particularly diﬃcult is that the partition function depends on the parameters：

         ∇θlog p(x; θ) = ∇θlog ˜p(x; θ) − ∇θlog Z(θ)


This identity is applicable only under certain regularity conditions :

         ∇θlog Z = Ex∼p(x)∇θlog ˜p(x) 


The Monte Carlo approach to learning undirected models provides an intuitive framework in which we can consider both positive and negative phases

Northeastern University (US)

The partition function is an integral or sum over unnormalized probabilities. This function, often denoted by Z, is used to normalize an unnormalized distribution into a valid distribution.
$$ p(\textbf{x}) =  \frac{1}{Z}  \tilde{p}(\textbf{x}) .  $$

where Z is: 

$$ Z =  \int   \tilde{p}(\textbf{x}) d\textbf{x}.  $$ 
or
$$ Z =  \sum_{x}  \tilde{p}(\textbf{x}) d\textbf{x}.  $$

Z = partition function.  
$$ \tilde{p}(\textbf{x})$$ = Unnormalized distribution. 
$$p(\textbf{x}) $$= Normalized distribution. 
In the context of deep learning, Z is usually intractable, meaning that it often needs to be approximated. 


The Partition Function (introduction).

Goodfellow, I., Bengio, Y., & Courville, A. (2016). $\mathit{Deep \ Learning.}$ MIT Press. Retrieved from [www.deeplearningbook.org](https://www.deeplearningbook.org) 

Deep Learning

-Pseudolikelihood is based on the observation that conditional probabilities
take this ratio-based form, and thus can be computed without knowledge of the
partition function. 
-if we partition x into a, b and c, where a contains the variables we want to find the conditional distribution over, b contains the variables we want to condition on, and c contains the variables that are not part of our query.


Pseudolikelihood

The gradient of the log-likelihood which is used in Maximum Likelihood Estimation can be decomposed into the following:
$$\nabla_{\boldsymbol \theta} \log p(\mathbf{x}; \boldsymbol \theta) = \nabla_{\boldsymbol \theta} \log \tilde{p}(\mathbf{x}; \boldsymbol \theta) - \nabla_{\boldsymbol \theta } \log Z(\boldsymbol \theta)$$
Where $\tilde{p}(\mathbf{x}; \boldsymbol \theta)$ is the unnormalized probabilioty density and $Z$ is the partition function. This is well-known as the decomposition into the positive phase and negative phase of learning. Due to the reliance of the partition function on the parameters, learning models by maximum likelihood is particularly difficult. 

Log-Likelihood Gradient

Given the scenario below, calculate the value of the normalizing constant required to turn the scores into a valid probability distribution, and then determine the final probability for 'Outcome A'.

Normalizing Model Outputs

A model produces unnormalized scores for three possible outcomes: {Outcome A: 8, Outcome B: 10, Outcome C: 2}. To convert these scores into a valid probability distribution, a normalization constant must be calculated by summing all the unnormalized scores. What is the final, normalized probability for Outcome B?

A machine learning model is designed to predict the next word in a sentence and has a vocabulary of 200,000 possible words. The model produces an unnormalized score for each of these words. Explain why calculating the exact normalization constant (the partition function) to turn these scores into a valid probability distribution is a significant computational challenge.

Learn Before

Related