1. Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.
2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of α.
3. Use K-fold cross-validation to choose α. That is, divide the training observations into K folds. For each k = 1, . . . , K:
(a) Repeat Steps 1 and 2 on all but the kth fold of the training data. (b) Evaluate the mean squared prediction error on the data in the
left-out kth fold, as a function of α.
Average the results for each value of α, and pick α to minimize the
average error.

4. Return the subtree from Step 2 that corresponds to the chosen value

Building a regression tree

A regression tree divides the predictor space ($$X_1, X_2, \dots, X_p$$) into $$J$$ distinct and non-overlapping regions. For each observation that falls into a given region $$R_j$$, the model makes the same prediction, which is the mean of the response values for the training observations in $$R_j$$. The goal is to construct regions that minimize the Residual Sum of Squares (RSS) of the model: $$\sum_{j=1}^{J} \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2$$.

University of Michigan - Ann Arbor

Google

Decision trees can be applied to both regression and classificatioin problems, and can thus be divided into

- Regression trees
- Classification trees

Decision trees applied to regression and classificatioin problems

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, pp. 3-7). New York: springer.

An Introduction to Statistical Learning with Applications in R

Roughly speaking, there are two steps.
1. We divide the predictor space—that is, the set of possible values for X1, X2, . . . , Xp—into J distinct and non-overlapping regions, R1,R2,...,RJ.
2. For every observation that falls into the region Rj , we make the same prediction, which is simply the mean of the response values for the training observations in Rj.

Steps of creating a decision tree

1. Over fitting -  Can create over-complex trees that do not generalize the data well. This problem gets solved by setting constraints on model parameters and pruning.
2. Not fit for continuous variables - While working with continuous numerical variables, decision tree loses information, when it categorizes variables in different categories.
3. Generally low accuracy compared to other ML algorithms
4.  Potentially unstable - because small variations in the data might result in a completely different tree being generated. This is called variance, which needs to be lowered by methods like bagging and boosting.

Source: [medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb ](https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb )

Decision Tree Disadvantages

Regression Tree

A classification tree is used to predict qualitative variables instead of quantitative variables. It is very similar to a regression tree and uses recursive binary splitting to generate the tree. However, it uses the Gini index ($$G = \sum_{k=1}^{K} \hat{p}_{mk} (1-\hat{p}_{mk})$$) or entropy ($$D = -\sum_{k=1}^{K} \hat{p}_{mk} \log \hat{p}_{mk}$$) to evaluate splits. A small value indicates high node purity, meaning the node contains observations from almost the same category. If the $$m$$th node is pure, its entropy and Gini index are very small.

Learn Before

Related

Learn After