Below is a representation of under and overfitting of the boundaries of U.S. sates. The data comes from the US Census Bureau. In its original format, the data is a single Keyhole Markup Language (KML) file which contains latitude and longitude coordinates of the borders of US states. The necessary latitude, longitude, and label (state) data were parsed from the KML files using a simple Python script. The main idea here is to understand the bias-variance trade-off and how that relates to under and overfitting. 

Additionally there are two examples of ways they avoided under and overfitting and created a much more accurate map using (gradient) boosting and random forest classifiers.

University of Michigan - Ann Arbor

The Random Forest is a model made up of many decision trees. It trains each one on a slightly different set of the observations, splitting nodes in each tree considering a limited number of the features. The final predictions of the random forest are made by averaging the predictions of each individual tree. In random forest algorithm, we do not need to do pruning for each "weak" decision tree. 

Advantage of Random Forest:
- Less overfitting
- Parallel implementation

Random Forest

Boosting is another approach that can be used for both classification and regression. Boosting improves predictions of a decision tree by learning slowly. Unlike bagging, boosting works sequentially, where each tree grown uses the information from those created before it, and works on modified versions of the original data set. Boosting combines a large number of trees ($\hat{f}^{1}, ...,  \hat{f}^{B}$). Decision trees are fitted based on the residuals of the model rather than the response (Y), then update the residuals by adding it to the fitted function. This means that each model created is slowly reducing the error of the previous ones. This causes an improvement of $\hat{f}$, particularly in places where it does not do as well as others. Decision trees using boosting tend to be very small.


Boosting

When we use a machine learning algorithm, we sample the training set, then use it to choose the parameters to reduce training set error, then sample the test set. Under this process, the expected test error is greater than or equal to the expected value of training error. 

The factors determining how well a machine learning algorithm will perform are its ability to:
1. Make the training error small.
2. Make the gap between training and test error small.

These two factors correspond to the two central challenges in machine learning: underﬁtting and overﬁtting. 


Factors determining how well a machine learning algorithm will perform

Sharing concepts, ideas and codes.
https://towardsdatascience.com/

Towards Data Science

Decision Trees are **non-parametric**, since we are not making any assumptions about the underlying function $$f$$. Consequently, decision trees have very low structural error (bias). By averaging the output of many trees in a random forest, we are lowering the variance of the model, while keeping the bias the same. 


Why Random Forests?

*Random forests* is quite similar to *bagging* with *decision trees* except that there is an improvement on spliting trees. They both use bootstrapped subset of training oberservations for a tree and repeated that process several times for the average/most common result. 

The difference lies in that, random forest only use randomly selected subsets of all predictors for spliting trees. This improvement *decorrelates* the trees and solves the problem of bagging that,  if there is a very strong predictor, most bagged trees might look similar and averaging could not reduce much variance for the highly correlated quantities. However, this can be addressed by decorrelating the trees thus reducing the variance and making the trees more reliable.

Random Forests vs. Bagging with Decision Trees

A study on different data sizes and different forest sizes (# of attributes) found that at a certian point, increasing the forest size becomes much more computationally expensive without any significant performance gain. 

The study found using between 64 and 128 attributes was an optimal number, with fewer being less accurate, and more slower with similar accuracy. 

Random Forests: Selecting Number of Trees

Basic algorithm:

1. Pick N random records from the dataset. 

2. Build a decision tree based on these N records. 

3. Choose the number of trees you want in your algorithm and repeat steps 1 and 2. 

4. In case of a regression problem, for a new record, each tree in the forest predicts a value for Y (output). The final value can be calculated by taking the average of all the values predicted by all the trees in the forest. Or, in the case of a classification problem, each tree in the forest predicts the category to which the new record belongs. Finally, the new record is assigned to the category that wins the majority vote. 

How Random Forest work?

A really good paper, that explains random forests very well : 
[www.stat.berkeley.edu/~breiman/randomforest2001.pdf](https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf) 

Random Forests

```python
from sklearn.metrics import accuracy_score
import sklearn.ensemble as skens

# build a random forest
rf_model = skens.RandomForestClassifier(n_estimators=10,oob_score=True, criterion='entropy')
rf_model.fit(df_iris_train.ix[:,:4],df_iris_train.species)

# predict the model 
predicted_labels = rf_model.predict(df_iris_test.ix[:,:4])
df_iris_test['predicted_rf_tree'] = predicted_labels

# find accuracy score
accuracy = accuracy_score(df_iris_test.species, predicted_labels)

# utility class to compare the predictions versus ground truth
def comparePlot(input_frame,real_column,predicted_column):
    df_a = input_frame.copy()
    df_b = input_frame.copy()
    df_a['label_source'] = 'Species'
    df_b['label_source'] = 'Classifier'
    df_a['label'] = df_a[real_column]
    df_b['label'] = df_b[predicted_column].apply(lambda x: 'Predict %s'%x)
    df_c = pd.concat((df_a, df_b), axis=0, ignore_index=True)
    sns.lmplot(x='sepal_length', y='sepal_width', col='label_source',
               hue='label', data=df_c, fit_reg=False, size=4);

# compare plot
comparePlot(df_iris_test,"species","predicted_rf_tree")
```

Random Forest Python Code

A Visual Look at Under and Overfitting using U.S. States

There are 3 tuning parameters when using Boosting:

1. The number of trees, denoted by $$B$$. Cross-validation is used in order to select $$B$$. If the value of $$B$$ is too big, it may overfit the data, albeit slowly.
2. The shrinkage parameter ($$\lambda$$), which controls the rate at which the boosting learns. This value is typically a small positive value; if it is very small, a large $$B$$ value may be required for it to work well.
3. The number of splits in each tree, denoted by $$d$$. This value controls how complex the boosting will be. If $$d = 1$$, there is a single split, where each model has at most 1 variable (also known as a stump). It is also known as the 'interaction depth', which controls the interaction order.

Tuning Parameters in Boosting

Boosting is heavily dependent on the trees formed before it, whereas in bagging, each tree is formed independently and varies instead based on the sample of predictors chosen. 

Boosting vs Bagging

Boosting algorithms are methods to help transform weak learners to stronger learners. There are 3 common boosting algorithms used in data science:

1. AdaBoosting (Adaptive Boosting)
2. GradientBoosting
3. XGBoost (Extreme Gradient Boosting)

Common Boosting algorithm

Below is a video that I found helpful in understanding boosting and how its used with decision trees. The coursera course the video is a part of may be useful as well.

[SAS]. Boosting [Video File]. Retrieved from [www.coursera.org/lecture/machine-learning-sas/boosting-2J0Df](https://www.coursera.org/lecture/machine-learning-sas/boosting-2J0Df) 

Coursera: Boosting with Decision Trees

https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/

what is boosting

A linear function ﬁt to the data suﬀers from underﬁtting—it cannot capture the curvature that is present in the data.

Aquadratic function ﬁt to the data generalizes well to unseen points. It does not suﬀer froma signiﬁcant amount of overﬁtting or underﬁtting.

A polynomial of degree 9 ﬁt to the data suﬀers from overﬁtting. The solution passes through all the training points exactly, but we have not been lucky enough for it to extract the correct structure.


Learn Before

Related