Consider a binary classification problem with 29 positve labels and 35 negative. You are building a decision tree and are trying to figure out whether to split on feature A1 or A2. Given the following information, which feature is the better choice?  You should calculate IG as an exercise.


How should we split?

Information Gain tries to split decision trees on the basis of the most relevant feature, however it does not take into account the number of values that the feature can take. If a feature has a high information gain and can take multiple distinct values that can exactly span the training data, splitting on it can lead to overfitting.

Disadvantage of Information Gain in Decision Trees


We define information gain as $$IG(X,Y) = H(Y) - H(Y\mid X)$$. 

We interpret information gain as the amount of information we learn about the label $$Y$$, given a specific feature $$X$$. 

Note that if the feature $$X$$ is completely uncorrelated with the label $$Y$$, then $$H(Y\mid X) = H(Y)$$. Therefore $$IG(X,Y) = 0$$. 

As such when we choose the next feature to split on we should choose the one that maximizes information gained. 


University of Michigan - Ann Arbor

Conditional entropy quantifies the remaining uncertainty in a label $$Y$$ after observing a feature $$X$$. Let $$Y$$ be the set of possible labels and $$X$$ be the set of possible feature values.

For a fixed feature value $$X=x$$, the conditional entropy is $$H(Y \mid X=x) = -\sum_{y\in Y} \Pr(Y=y \mid X=x)\log \Pr(Y=y \mid X=x)$$.

Averaging over all values of $$X$$ gives the overall conditional entropy $$H(Y\mid X) = \sum_{x\in X} \Pr(X=x)\, H(Y\mid X=x)$$, which is the quantity used when computing information gain.

Intuitively: how uncertain are we about a person's health outcome, given that we know how many times a year they visit the doctor?

Conditional Entropy

https://github.com/eecs445-f16/umich-eecs445-f16/blob/master/lecture11_info-theory-decision-trees/lecture11_info-theory-decision-trees.pdf

Learn Before

Related

Learn After