To get accurate predictions with a machine learning model, it is important to avoid overfitting and underfitting. To achieve this, the bias and variance of the algorithm need to be adjusted. Ensemble bagging, of which random forest is an example is one way to do this.
Bias : It is the difference between the average prediction of a model and the correct values. An algorithm with high bias pays attention to only certain features and oversimplifies the model.
High Bias -> Underfitting -> High training error + High testing error
Variance : It is a measure of how much predictions might change when model is fit to a different training dataset. Models with high variance pay too much attention to training data and do not generalise well over data they have not seen.
High Variance -> Overfitting -> Low training error + High testing error
The ideal model has both a low bias and a low variance.
A trade-off between bias and variance exists when deciding upon the complexity of the model. A very simple model with very few parameters will have a high bias but a low variance. A complex model with many parameters will have a high variance with a low bias.
Ensemble methods are learning models that improve performance by combining the results of multiple models. They can be used to improve the predictions of both classification and regression models.
Advantages of Ensemble methods:
Ensemble methods in conjunction with bootstrap resampling is called ensemble bagging.
Deterministic machine learning algorithms are those that give the same results, when trained multiple times on the same dataset. In this case, one way to improve performance is to use an ensemble that combines different learning algorithms working on the same dataset. However, the inductive biases of most learning models are highly correlated so errors tend to be similar.
Another solution would be to use the same type of model but change the dataset. This has its own problem, as datasets can be hard to come by. Dividing the same dataset to train multiple models means that each model gets trained on a very small subset of the original dataset. This could result in an underfit prediction. Bootstrapping is a way to create multiple datasets from the original, without compromising on the amount of data provided to the model.
In bootstrapping, a single dataset (D) with N-many training examples is taken. From this dataset, M-many bootstrapping training sets can be created (D1, D2….. Dm). These datasets also contain N training examples, drawn randomly with replacement from D. Each of them can now be passed to M models (M1, M2… Mm) used in the ensemble. These models yield a list of results (f1, f2…fm), of which the final result can be selected by voting (in case of classification) or by taking the mean (in case of regression).
Bootstrapping produces datasets that are similar but not identical. Given, N training examples per dataset:
Probability that example 1 will not be selected once = $ 1 - \frac {1}{N} $
Probability that example 1 will not be selected at all = $(1 - \frac {1}{N})^{N} $
As N -> ∞ , $(1 - \frac {1}{N})^{N}$ -> $\frac {1}{e}$
The value of $\frac {1}{e}$ is around 0.3679 so only 63% of the original dataset will be represented in each bootstrapping set. This allows bagging to reduce variance since even if the results of the models are overfit, they are overfit to different things.
Random is an extension of ensemble bagging that uses decision trees as the learning model. The trees have fixed structures and random features.
Random forest is based on two concepts; bagging, which has been discussed previously and subspace sampling. Subspace sampling is necessary to reduce variance, since the trees created after bagging are correlated. They each use the same greedy algorithm and the same set of features to select the best split point. This means that they are likely to select very similar split points, resulting in very similar trees. Subspace resampling prevents this form happening by changing the features being used at each branching point, to generate split. It is the process of randomly selecting (with or without replacement), some number m < p features to evaluate each split. Here, p = Number of features.
The following code examples will help explain the steps involved in constructing a random forest that works on binary classification problems.
The subsample() function creates a bootstrapping training sets by randomly selecting examples from given dataset with replacement. The training set is of size = length of dataset * ratio. If ratio = 1, each training set = size of dataset passed to subset(). To create n training sets, function must be called n times.
The get_split() function creates a list named features, that stores the values of the features that will be used to evaluate split. The features are selected randomly from the features present in the original dataset. The features list is passed on to function test_split() which divides dataset based on split point chosen from the list features and gini_index() which calculates the gini impurity. In this example, a new list of features is not being created at each branching point but each tree in the ensemble receives a different set of features.
to_terminal(), split() and build_tree() are functions that must be called to build a binary decision tree. predict() is the function that makes predictions using the decision tree.
The function bagging_predict() accepts results from all the trees constructed and takes a vote in order to give the final result. Since this is a binary classification problem, bagging_predict() counts the number of 1s and 0s given as a result of the trees and outputs the majority as the final result (eg: 1->5, 0->1, output = 1).
random_forest() executes the random forest algorithm. Training data and test data are passed to algorithm as well as depth of tree, number of bootstrapping training sets and the size of each bootstrapping training set.