Ensemble Method : Boosting (Classification)

- Simran Chourasia


Ensemble methods are algorithms that combine various training models or classes of models to give a single predictive model in order to decrease variance, bias, or improve predictions.

Bias and Variance

Bagging is an ensemble method to decrease the variance in the prediction by generating additional data for training from dataset using combinations with repetitions to produce multi-sets of the original data. Boosting is an iterative technique which adjusts the weight of an observation based on the last classification. In this lecture notes, we will be discussing the boosting method applied to a classification problem.


Boosting

Boosting refers to a family of algorithms that are able to convert weak learners to strong learners. The main principle of boosting is to fit a sequence of weak learners− models that are only slightly better than random guessing, such as small decision trees− to weighted versions of the data. More weight is given to examples that were misclassified by earlier rounds.

The predictions are then combined through a weighted majority vote (classification) or a weighted sum (regression) to produce the final prediction. The principal difference between boosting and the committee methods, such as bagging, is that base learners are trained in sequence on a weighted version of the data.


Gradient Boosting

Gradient Boosting has three main components:


Gradient Boosting in Classification

Illustration through an example:

Lets look at a binary classification problems which aims at predicting the fate of the passengers on Titanic based on a few features: their age, gender, etc. We will take only a subset of the dataset and choose certain columns, for convenience. Our dataset looks something like this

bias and variance
Titanic Passenger Data

where 'Pclass' (Passenger Class), is categorical - 1, 2, or 3;
'Age' is the age of the passenger when they were on the Titanic;
'Fare' is the Passenger Fare; 'Sex' is the gender of the person;
'Survived' (label) refers to whether or not the person survived the crash; 0 if they did not, 1 if they did.

Now, we start with one leaf node that predicts the initial value for every individual passenger. For a classification problem, it will be the log(odds) of the target value. log(odds) is the equivalent of average in a classification problem. Since four passengers in our case survived, and two did not survive, log(odds) that a passenger survived would be:

bias and variance
bias and variance
This becomes our initial leaf.
bias and variance
The easiest way to use the log(odds) for classification is to convert it to a probability. To do so, we'll use this formula:
bias and variance

If the probability of surviving is greater than 0.5, then we first classify everyone in the training dataset as survivors. (0.5 is a common threshold used for classification decisions made based on probability; the threshold can easily be taken as something else.)

Now we need to calculate the Pseudo Residual, i.e, the difference between the observed value and the predicted value. Let us draw the residuals on a graph.

bias and variance

The blue and the yellow dots are the observed values. The blue dots are the passengers who did not survive with the probability of 0 and the yellow dots are the passengers who survived with a probability of 1. The dotted line here represents the predicted probability which is 0.7

We need to find the residual which would be :

bias and variance
bias and variance

Here, 1 denotes Yes and 0 denotes No.

We will use this residual to get the next tree.

bias and variance
Branching out data points using the residual values

We use a limit of two leaves here to simplify our example, but in reality, Gradient Boost has a range between 8 leaves to 32 leaves.

Because of the limit on leaves, one leaf can have multiple values. Predictions are in terms of log(odds) but these leaves are derived from probability which cause disparity. So, we can't just add the single leaf we got earlier and this tree to get new predictions because they're derived from different sources. We have to use some kind of transformation. The most common form of transformation used in Gradient Boost for Classification is :

bias and variance

The numerator in this equation is sum of residuals in that particular leaf.

The denominator is sum of (previous prediction probability for each residual ) * (1 - same previous prediction probability).

The first leaf has only one residual value that is 0.3, and since this is the first tree, the previous probability will be the value from the initial leaf, thus, same for all residuals. Hence,

bias and variance

For the second leaf,

bias and variance

Similarly, for the last leaf:

bias and variance

Now the transformed tree looks like:

bias and variance

Now that we have transformed it, we can add our initial lead with our new tree with a learning rate.

bias and variance

Learning Rate is used to scale the contribution from the new tree. This results in a small step in the right direction of prediction. Empirical evidence has proven that taking lots of small steps in the right direction results in better prediction with a testing dataset i.e the dataset that the model has never seen as compared to the perfect prediction in 1st step. Learning Rate is usually a small number like 0.1

We can now calculate new log(odds) prediction and hence a new probability.

For example, for the first passenger, Old Tree = 0.7. Learning Rate which remains the same for all records is equal to 0.1 and by scaling the new tree, we find its value to be -0.16. Hence, substituting in the formula we get:

bias and variance

Similarly, we substitute and find the new log(odds) for each passenger and hence find the probability. Using the new probability, we will calculate the new residuals.

This process repeats until we have made the maximum number of trees specified or the residuals get very small.

The algorithm:

Let us understand the process mathematically
bias and variance
where xi = the input variables that we feed into our model
yi = the target variable that we are trying to predict

We can predict the log likelihood of the data given the predicted probability

bias and variance
where yi is observed value ( 0 or 1 ) and p is the predicted probability.

The goal would be to maximize the log likelihood function. Hence, if we use the log(likelihood) as our loss function where smaller values represent better fitting models then:

bias and variance
Now the log(likelihood) is a function of predicted probability p but we need it to be a function of predictive log(odds). So, let us try and convert the formula :
bias and variance
We know that :
bias and variance
Substituting,
bias and variance
Now,
bias and variance
Hence,
bias and variance

Now that we have converted the p to log(odds), this becomes our Loss Function.

We have to show that this is differentiable.

bias and variance

This can also be written as :

bias and variance
Now the actual steps of the model building are as follows.

Advantages of Gradient Boosting


Reference