According to Bayes' theorem, we have
$$
P\left(Y=k | X=x\right)=\frac{P\left(Y=k\right) P\left(X=x | Y=k\right)}{\sum_{l=1}^{K} P\left(Y=l\right) P\left(X=x | Y=l\right)}\\
=\frac{\pi_{k} f_{k}(x)}{\sum_{l=1}^{K} \pi_{l} f_{l}(x)}.
$$
In LDA with the assumption of Gaussian distribution in every class, we have the Gaussian density 
$$
f_k(x)=\frac{1}{(2 \pi_k)^{p / 2}|\boldsymbol{\Sigma}|^{1 / 2}} \exp \left(-\frac{1}{2}(x-\mu_k)^{T} \boldsymbol{\Sigma}^{-1}(x-\mu_k)\right)
$$
for a $p$-dimensional random vector $X$ with the distribution $N(\mu_k, \mathbf{\Sigma})$. All we need to do is to plug $f_k(x)$ into Bayes' theorem above and find out the $k$ which maximize $P\left(Y=k | X=x\right)$. LDA is a Bayes classifier with the assumption of Gaussian distribution in every class.

University of Michigan - Ann Arbor

Linear discriminant analysis (LDA) is another means of classification similar to logistic regression, but better suited to handle non-binary classifications (i.e. when there are >=3 possible output classes).  

LDA assumes Gaussian distributions for observations in each class, with means being class specific and a covariance common to all classes. Then with the help of Bayes' theorem, the probability for one observation being in each class can be estimated. The class with the highest probability would be the classification result.

In this assumption, the classification is equivalent to maximize the *discriminant function* 
$$
\delta_{k}(x)=x^{T} \boldsymbol{\Sigma}^{-1} \mu_{k}-\frac{1}{2} \mu_{k}^{T} \boldsymbol{\Sigma}^{-1} \mu_{k}+\log \pi_{k}
$$
where ${\mu}_{k}$ is the mean for observations in the $k$th class, ${\pi}_{k}$ is the prior probability that an observation belongs to the $k$th class, and $\Sigma$ is the common covariance matrix. The *discriminant function* is linear to $x$, which is why the method is called *linear discriminant analysis*.


Linear Discriminant Analysis (LDA)

Bayes Theorem is about predicting the probability of an event A occurring given an event B has already occurred. We can use the inverse probability of this statment and the probability of each event to find this value. 
P(A) is called prior probability, P(B|A) is Likelihood, P(B) is evidence, and P(A|B) is posterior probability.


Bayes Theorem Overview

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, pp. 3-7). New York: springer.

An Introduction to Statistical Learning with Applications in R

LDA is more stable than logistic regression in instances where the classes are well separated or when "n is small and the distribution of the predictors X is approximately normal in each of the classes" (ISLR). 

Using LDA vs Logistic Regression

Using Bayes' theorem in LDA

First, the means and covariance of the assumed gaussian distributions in each class need to be estimated. When $p=1$, which means only one predictor, that is to estimate the class specific mean $\mu_k$ and the common variance $\sigma^2$. They can be estimated as follows
$$\hat{\mu}_{k}=\frac{1}{n_{k}} \sum_{i: y_{i}=k} x_{i}$$
$$\hat{\sigma}^{2}=\frac{1}{n-K} \sum_{k=1}^{K} \sum_{i: y_{i}=k}\left(x_{i}-\hat{\mu}_{k}\right)^{2}$$
where $n$ is the total number of training observations, and $n_k$ is the number
of training observations in the kth class. When $p>1$, which means multiple predictors, the estimation would be similar but much more complicated with $p(p+1)/2$ parameters in the covariance matrix.

As for the prior probabilty $\pi_k$, it can be estimated based on the proportion of class $k$ observations in the training set, which would be 
$$
\hat{\pi}_{k}=n_{k} / n.
$$


Estimate parameters in LDA

LDA assumes that the covariance matrix is common to all classes, while QDA assumes class-specific covariance matrices.  Notice that for a $p$-dimensional observation $X$, one covariance needs $p(p+1)/2$ parameters, and then QDA will need $Kp(p+1)/2$ parameters for the covariance matrices. While this makes QDA much more flexible, the much more parameters also bring high variance. QDA is preferred when the training set is large, and so the variance is not a major concern.

Using LDA vs QDA

QDA is a variant of LDA in which an individual covariance matrix is estimated for every class of observations. QDA is particularly useful if there is prior knowledge that individual classes exhibit distinct covariances. A disadvantage of QDA is that it cannot be used as a dimensionality reduction technique.

Quadratic Discriminant Analysis (QDA)

Both rank the new axes in the order of importance. PC1 (the first new axis that PCA creates) accounts for the most variation in data, PC2 (the second new axes) does the second-best job and so on… LD1 (the first new axis that LDA creates) accounts for the most variation in data, LD2(the second new axes) does the second-best job and so on…
Both the algorithms tell us which attribute or feature is contributing more in creating the new axes.
LDA is like PCA — both try to reduce the dimensions. PCA looks for attributes with the most variance. LDA tries to maximize the separation of known categories.

Similarities between PCA and LDA:

A method of supervised classification that takes many features, assumed to be independent, and uses Bayes' Theorem to determine given those independent variables how likely the likelihood of another event occurring.

It's called "Naive" because it assumes that features are conditionally independent, given the class, i.e., for all instances of a given class, the features have little or no correlation with each other.

(Naive) Bayes Classifier

Bayesian networks were developed to understand reasoning under conditions of uncertainty. It is based on Bayes Theorem and is a probabilistic model that represents a set of variables and their conditional dependencies by being able to find the probability of facts being true/false given already defined facts. A Bayesian network is represented by a directed acyclic graph which helps to visualize the probabilistic model for reviewing relationships between random variables and reasonings for causal probabilities given certain facts



Bayesian Networks

Example 1: Imagine a billiard table, and suppose that the billiard track bounces off the table many times, so we don't know where the ball will end up. Let's say the billiard table is L, and the probability that the ball ends up x feet from the edge is x/L.

The inverse probability problem: Suppose we observe that the ball's final stop is a foot away from the table, but we don't know the length of the table, L.

Bayes' main work is to break this cognitive asymmetry and propose a method to estimate the inverse probability.


Inverse-probability: break cognitive asymmetry 

This example in the book is a discussion on how we can update our beliefs using Bayes's rule.  The question the example tries to address is, "What is the probability that she has breast cancer, given that the test came out positive?", which can be expressed as P(D | T), where D is the hypothesis, disease, and T is the evidence, test.

$$
        $$(Updated probability of D) = P(D | T) = (likelihood ratio)) × (prior probability of D)$$
      $$

(Note) The "likelihood ratio" is given by P(T | D)/P(T). 



Mammogram (Breast Cancer Screening Example)

There are three out of four binary classification metrics mentioned in the Book of Why; true positive, false positive, and false negative.  The fourth one that is not mentioned in the book is true negative.  The following are the two primary building blocks of such four metrics: 

1. True vs. False: These are about prediction's correctness whether a prediction outcome was made accurately/ correctly.  The correct prediction is True, and the incorrect prediction is False.
2. Positive vs. Negative: These are about a binary classification -(e.g.) In the mammogram example in the book, "positive" is positive test results (breast cancer), whereas "negative" is negative test results (not breast cancer).  

The combinations of the above concepts result in True positive, True negative, false positive, and false negative, each of which is explained in the subsequent nodes to follow.  


Binary Classification Metrics

Inverse probability, as supposed to forward probability, is a conditional probability that a hypothesis (cause), H, is true given that the evidence (effect), E is true.  Inverse probabiity is harder to estimate than estimating a forward probability.  

It can be expressed as:

	 $$ P(H | E) $$


Inverse Probability

Forward probability, as supposed to inverse probability, is the conditional probability that an evidence (effect), E, is true given that the hypothesis (cause), H is true.  For the definition of inverse probability, see the node, "Inverse Probability Defined." 

It can be expressed as:

  $$P(E | H)$$

Estimating the forward probability is easier than estimating the inverse probability because, as long as we know the cause, it is easy to estimate the probability of the effect.  

Pearl calls forward probability, calculating conditional probability in the direction, in which "our judgment is more reliable."


Forward Probability

Bayes' Theorem can also be used as the following :

$$ P(A | B) = \frac{P(A \cap B)}{P(B)}$$
The way to think about this is that given two events A and B, the probability that A happens, given B occurs, is the probability of their intersection occuring i.e. both events occuring, over the probability that B occurs.
The easiest way to see why this works is for Independent events. If A and B are independent, the probability that they both occur is the product of their individual probabilities. Which then simplifies the above equation to :
$$ P(A|B)    =    \frac{P(A \cap B)}{P(B)}$$
$$ \implies P(A|B) = \frac{P(A).P(B)}{P(B)}$$
$$ \implies P(A|B) = P(A)$$
Which makes sense since the events are independent.

Learn Before

Related