*Random forests* is quite similar to *bagging* with *decision trees* except that there is an improvement on spliting trees. They both use bootstrapped subset of training oberservations for a tree and repeated that process several times for the average/most common result. 

The difference lies in that, random forest only use randomly selected subsets of all predictors for spliting trees. This improvement *decorrelates* the trees and solves the problem of bagging that,  if there is a very strong predictor, most bagged trees might look similar and averaging could not reduce much variance for the highly correlated quantities. However, this can be addressed by decorrelating the trees thus reducing the variance and making the trees more reliable.

University of Michigan - Ann Arbor

Bootstrap aggregation, or bagging, is a way to help construct more powerful prediction models by reducing high variance of a statistical learning method. It is often used to improve the performance of decision trees. In order to reduce the variance, $B$ different bootstrapped training data sets are produced (typically taken from a single training data set). It then uses the $b$th bootstrapped training data set to train the method to produce  $\hat{f}^{*b}(x)$. Afterwards, all the predictions are then averaged in order to get  $\hat{f}_{bag}(x) =  \frac{1}{B} \sum_{b=1}^B \hat{f}^{*b}(x)$. 

Bagging

The Random Forest is a model made up of many decision trees. It trains each one on a slightly different set of the observations, splitting nodes in each tree considering a limited number of the features. The final predictions of the random forest are made by averaging the predictions of each individual tree. In random forest algorithm, we do not need to do pruning for each "weak" decision tree. 

Advantage of Random Forest:
- Less overfitting
- Parallel implementation

Random Forest

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, pp. 3-7). New York: springer.

An Introduction to Statistical Learning with Applications in R

Bagging can be applied to decision trees, particularly with regression. It follows the same overall process: B regression trees are produced using B bootstrapped training sets. Afterwards, the predictions of all B regression trees are averaged in order to obtain a model that has reduced the variance of the B individual decision trees.

Bagging for Regression Trees

Similar to bagging for regression trees, bagging can be extended to classification problems as well. There exist various approaches, but the most simple involves picking the "majority vote" using $B$ trees for each test observation, which represents the most common class among those $B$ predictions.

Bagging for Classification Trees

Out-of-Bag (OOB) observations give way to estimating the test error of a bagged model without cross-validation. OOB means that the observations were not used to fit a given bagged tree. Using the trees that included OOB observations, it is possible to predict the response for the $i$th observation, which returns close to $B/3$ predictions. 
In the case of regression trees, these predictions can then be averaged in order to obtain a single prediction for the $i$th observation. 
With classification trees, the single prediction for the $i$th observation can be found by taking a majority vote. As a result, the OOB prediction can be used to estimate error; with regression trees, the overall OOB MSE can be found, and similarly, classification error for classification trees. 

Out-of-Bag Error Estimation

One downside of bagging is the tradeoff between prediction accuracy and model interpretability. Although bagging increases prediction accuracy, it does so at the expense of interpretability. We can no longer represent a bagged model as a single tree. It is also no longer clear which variables are most important to the model once bagging has been integrated. 

However, there are ways to determine the most important predictors. For regression trees, we can record the total amount the RSS is decreased as a result of the splits for a specific predictor. This value is then averaged over all $B$ trees and the predictor is considered important if that value is large. Likewise, with classification trees, we can record the total amount that the Gini index is decreased as a result of the splits for a specific predictor, and average it over $B$ trees.

Variable Importance Measures

Random Forests vs. Bagging with Decision Trees

Boosting is heavily dependent on the trees formed before it, whereas in bagging, each tree is formed independently and varies instead based on the sample of predictors chosen. 

Boosting vs Bagging

Below is a video that I found helpful about understanding bagging and decision trees. The coursera course the video is a part of may be helpful as well.

[SAS]. Bagging [Video File]. Retrieved from [www.coursera.org/lecture/machine-learning-sas/bagging-Rj2Fq](https://www.coursera.org/lecture/machine-learning-sas/bagging-Rj2Fq) 

Coursera: Bagging using Decision Trees

Step 1: Here you replace the original data with new data. The new data usually have a fraction of the original data's columns and rows, which then can be used as hyper-parameters in the bagging model.

Step 2: You build classifiers on each dataset. Generally, you can use the same classifier for making models and predictions.

Step 3: Lastly, you use an average value to combine the predictions of all the classifiers, depending on the problem. Generally, these combined values are more robust than a single model.

Steps of Bagging

Decision Trees are **non-parametric**, since we are not making any assumptions about the underlying function $$f$$. Consequently, decision trees have very low structural error (bias). By averaging the output of many trees in a random forest, we are lowering the variance of the model, while keeping the bias the same. 


Why Random Forests?

A study on different data sizes and different forest sizes (# of attributes) found that at a certian point, increasing the forest size becomes much more computationally expensive without any significant performance gain. 

The study found using between 64 and 128 attributes was an optimal number, with fewer being less accurate, and more slower with similar accuracy. 

Random Forests: Selecting Number of Trees

Basic algorithm:

1. Pick N random records from the dataset. 

2. Build a decision tree based on these N records. 

3. Choose the number of trees you want in your algorithm and repeat steps 1 and 2. 

4. In case of a regression problem, for a new record, each tree in the forest predicts a value for Y (output). The final value can be calculated by taking the average of all the values predicted by all the trees in the forest. Or, in the case of a classification problem, each tree in the forest predicts the category to which the new record belongs. Finally, the new record is assigned to the category that wins the majority vote. 

How Random Forest work?

A really good paper, that explains random forests very well : 
[www.stat.berkeley.edu/~breiman/randomforest2001.pdf](https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf) 

Random Forests

```python
from sklearn.metrics import accuracy_score
import sklearn.ensemble as skens

# build a random forest
rf_model = skens.RandomForestClassifier(n_estimators=10,oob_score=True, criterion='entropy')
rf_model.fit(df_iris_train.ix[:,:4],df_iris_train.species)

# predict the model 
predicted_labels = rf_model.predict(df_iris_test.ix[:,:4])
df_iris_test['predicted_rf_tree'] = predicted_labels

# find accuracy score
accuracy = accuracy_score(df_iris_test.species, predicted_labels)

# utility class to compare the predictions versus ground truth
def comparePlot(input_frame,real_column,predicted_column):
    df_a = input_frame.copy()
    df_b = input_frame.copy()
    df_a['label_source'] = 'Species'
    df_b['label_source'] = 'Classifier'
    df_a['label'] = df_a[real_column]
    df_b['label'] = df_b[predicted_column].apply(lambda x: 'Predict %s'%x)
    df_c = pd.concat((df_a, df_b), axis=0, ignore_index=True)
    sns.lmplot(x='sepal_length', y='sepal_width', col='label_source',
               hue='label', data=df_c, fit_reg=False, size=4);

# compare plot
comparePlot(df_iris_test,"species","predicted_rf_tree")
```

Random Forest Python Code

Below is a representation of under and overfitting of the boundaries of U.S. sates. The data comes from the US Census Bureau. In its original format, the data is a single Keyhole Markup Language (KML) file which contains latitude and longitude coordinates of the borders of US states. The necessary latitude, longitude, and label (state) data were parsed from the KML files using a simple Python script. The main idea here is to understand the bias-variance trade-off and how that relates to under and overfitting. 

Additionally there are two examples of ways they avoided under and overfitting and created a much more accurate map using (gradient) boosting and random forest classifiers.

Learn Before

Related