Ensemble Method : Boosting (Classification)

- Simran Chourasia

Ensemble methods are algorithms that combine various training models or classes of models to give a single predictive model in order to decrease variance, bias, or improve predictions.

Bias and Variance

Bias: The bias is an error from erroneous assumptions in the learning algorithm. It simply means how far away is our estimated values from actual values. In the figure below, let’s say our target is the central red circle. If our predictions (blue dots) are close to the original target, then we say we have a low bias. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).

Variance: The variance is an error from sensitivity to small fluctuations in the training set. It is a measure of spread or variations in our predictions. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).

Bias and Variance

Bagging is an ensemble method to decrease the variance in the prediction by generating additional data for training from dataset using combinations with repetitions to produce multi-sets of the original data. Boosting is an iterative technique which adjusts the weight of an observation based on the last classification. In this lecture notes, we will be discussing the boosting method applied to a classification problem.

Boosting

Boosting refers to a family of algorithms that are able to convert weak learners to strong learners. The main principle of boosting is to fit a sequence of weak learners− models that are only slightly better than random guessing, such as small decision trees− to weighted versions of the data. More weight is given to examples that were misclassified by earlier rounds.

The predictions are then combined through a weighted majority vote (classification) or a weighted sum (regression) to produce the final prediction. The principal difference between boosting and the committee methods, such as bagging, is that base learners are trained in sequence on a weighted version of the data.

Gradient Boosting

Gradient Boosting has three main components:

Loss Function - The role of the loss function is to estimate how good the model is at making predictions with the given data. This could vary depending on the problem at hand. For example, if we're trying to predict the weight of a person depending on some input variables (a regression problem), then the loss function would be something that helps us find the difference between the predicted weights and the observed weights. On the other hand, if we're trying to categorize if a person will like a certain movie based on their personality, we'll require a loss function that helps us understand how accurate our model is at classifying people who did or didn't like certain movies.
Weak Learner - A weak learner is one that classifies our data but does so poorly, perhaps no better than random guessing. In other words, it has a high error rate. These are typically decision trees (also called decision stumps, because they are less complicated than typical decision trees).
Additive Model - This is the iterative and sequential approach of adding the trees (weak learners) one step at a time. After each iteration, we need to be closer to our final model. In other words, each iteration should reduce the value of our loss function.

Gradient Boosting in Classification

Illustration through an example:

Lets look at a binary classification problems which aims at predicting the fate of the passengers on Titanic based on a few features: their age, gender, etc. We will take only a subset of the dataset and choose certain columns, for convenience. Our dataset looks something like this

Titanic Passenger Data

where 'Pclass' (Passenger Class), is categorical - 1, 2, or 3;
'Age' is the age of the passenger when they were on the Titanic;
'Fare' is the Passenger Fare; 'Sex' is the gender of the person;
'Survived' (label) refers to whether or not the person survived the crash; 0 if they did not, 1 if they did.

Now, we start with one leaf node that predicts the initial value for every individual passenger. For a classification problem, it will be the log(odds) of the target value. log(odds) is the equivalent of average in a classification problem. Since four passengers in our case survived, and two did not survive, log(odds) that a passenger survived would be:

This becomes our initial leaf. bias and variance

The easiest way to use the log(odds) for classification is to convert it to a probability. To do so, we'll use this formula: bias and variance

If the probability of surviving is greater than 0.5, then we first classify everyone in the training dataset as survivors. (0.5 is a common threshold used for classification decisions made based on probability; the threshold can easily be taken as something else.)

Now we need to calculate the Pseudo Residual, i.e, the difference between the observed value and the predicted value. Let us draw the residuals on a graph.

The blue and the yellow dots are the observed values. The blue dots are the passengers who did not survive with the probability of 0 and the yellow dots are the passengers who survived with a probability of 1. The dotted line here represents the predicted probability which is 0.7

We need to find the residual which would be :

Here, 1 denotes Yes and 0 denotes No.

We will use this residual to get the next tree.

Branching out data points using the residual values

We use a limit of two leaves here to simplify our example, but in reality, Gradient Boost has a range between 8 leaves to 32 leaves.

Because of the limit on leaves, one leaf can have multiple values. Predictions are in terms of log(odds) but these leaves are derived from probability which cause disparity. So, we can't just add the single leaf we got earlier and this tree to get new predictions because they're derived from different sources. We have to use some kind of transformation. The most common form of transformation used in Gradient Boost for Classification is :

The numerator in this equation is sum of residuals in that particular leaf.

The denominator is sum of (previous prediction probability for each residual ) * (1 - same previous prediction probability).

The first leaf has only one residual value that is 0.3, and since this is the first tree, the previous probability will be the value from the initial leaf, thus, same for all residuals. Hence,

For the second leaf,

Similarly, for the last leaf:

Now the transformed tree looks like:

Now that we have transformed it, we can add our initial lead with our new tree with a learning rate.

Learning Rate is used to scale the contribution from the new tree. This results in a small step in the right direction of prediction. Empirical evidence has proven that taking lots of small steps in the right direction results in better prediction with a testing dataset i.e the dataset that the model has never seen as compared to the perfect prediction in 1st step. Learning Rate is usually a small number like 0.1

We can now calculate new log(odds) prediction and hence a new probability.

For example, for the first passenger, Old Tree = 0.7. Learning Rate which remains the same for all records is equal to 0.1 and by scaling the new tree, we find its value to be -0.16. Hence, substituting in the formula we get:

Similarly, we substitute and find the new log(odds) for each passenger and hence find the probability. Using the new probability, we will calculate the new residuals.

This process repeats until we have made the maximum number of trees specified or the residuals get very small.

The algorithm:

Let us understand the process mathematically bias and variance

where x_i = the input variables that we feed into our model
y_i = the target variable that we are trying to predict

We can predict the log likelihood of the data given the predicted probability

where y_i is observed value ( 0 or 1 ) and p is the predicted probability.

The goal would be to maximize the log likelihood function. Hence, if we use the log(likelihood) as our loss function where smaller values represent better fitting models then:

Now the log(likelihood) is a function of predicted probability p but we need it to be a function of predictive log(odds). So, let us try and convert the formula : bias and variance

We know that : bias and variance

Substituting, bias and variance

Now,

Hence,

Now that we have converted the p to log(odds), this becomes our Loss Function.

We have to show that this is differentiable.

This can also be written as :

Now the actual steps of the model building are as follows.

Step 1: Initialize model with a constant value
Here, y_i is the observed values, L is the loss function, and gamma is the value for log(odds).

We are summating the loss function i.e. we add up the Loss Function for each observed value.

argmin over gamma means that we need to find a log(odds) value that minimizes this sum.

Then, we take the derivative of each loss function :

... and so on.
Step 2: for m = 1 to M:
- This step needs you to calculate the residual using the given formula. We have already found the Loss Function to be as :
  
  Hence,
- Fit a regression tree to the residual values and create terminal regions
  
  Because the leaves are limited for one branch hence, we might have more than one value in a particular terminal region.
  
  In our first tree, m=1 and j will be the unique number for each terminal node. So R11, R21 and so on.
- For each leaf in the new tree, we calculate gamma which is the output value. The summation should be only for those records which goes into making that leaf. In theory, we could find the derivative with respect to gamma to obtain the value of gamma but that could be extremely wearisome due to the hefty variables included in our loss function.
  
  Substituting the loss function and i=1 in the equation above, we get:
  
  We use second order Taylor Polynomial to approximate this Loss Function :
  
  There are three terms in our approximation. Taking derivative with respect to gamma gives us:
  
  Equating this to 0 and subtracting the single derivative term from both the sides
  
  Then, gamma will be equal to :
  
  The gamma equation may look humongous but in simple terms, it is :
  
  We will just substitute the value of derivative of Loss Function
  
  Now we shall solve for the second derivative of the Loss Function. After some heavy computations, we get :
  
  We have simplified the numerator as well as the denominator. The final gamma solution looks like :
  
  We were trying to find the value of gamma that when added to the most recent predicted log(odds) minimizes our Loss Function. This gamma works when our terminal region has only one residual value and hence one predicted probability. But, do recall from our example above that because of the restricted leaves in Gradient Boosting, it is possible that one terminal region has many values. Then the generalized formula would be:
  
  Hence, we have calculated the output values for each leaf in the tree.
- This formula is asking us to update our predictions now. In the first pass, m =1 and we will substitute F0(x), the common prediction for all samples i.e. the initial leaf value plus nu, which is the learning rate into the output value from the tree we built, previously. The summation is for the cases where a single sample ends up in multiple leaves.
  
  Now we will use this new F1(x) value to get new predictions for each sample.
  
  The new predicted value should get us a little closer to actual value. It is to be noted that in contrary to one tree in our consideration, gradient boosting builds a lot of trees and M could be as large as 100 or more.
  
  This completes our for loop in Step 2 and we are ready for the final step of Gradient Boosting.
Step 3: Output
If we get a new data, then we shall use this value to predict if the passenger survived or not. This would give us the log(odds) that the person survived. Plugging it into 'p' formula:

If the resultant value lies above our threshold then the person survived, else did not.

Advantages of Gradient Boosting

Often provides predictive accuracy that cannot be trumped.
Lots of flexibility - can optimize on different loss functions and provides several hyper parameter tuning options that make the function fit very flexible.
No data pre-processing required - often works great with categorical and numerical values as is.
Handles missing data - imputation not required.

Bias and Variance

Boosting

Gradient Boosting

Gradient Boosting in Classification

Illustration through an example:

The algorithm:

Advantages of Gradient Boosting

Reference