- Simran Chourasia
Ensemble methods are algorithms that combine various training models or classes of models to give a single predictive model in order to decrease variance, bias, or improve predictions.
Bias: The bias is an error from erroneous assumptions in the learning algorithm. It simply means how far away is our estimated values from actual values. In the figure below, let’s say our target is the central red circle. If our predictions (blue dots) are close to the original target, then we say we have a low bias. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
Variance: The variance is an error from sensitivity to small fluctuations in the training set. It is a measure of spread or variations in our predictions. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).
Bagging is an ensemble method to decrease the variance in the prediction by generating additional data for training from dataset using combinations with repetitions to produce multi-sets of the original data. Boosting is an iterative technique which adjusts the weight of an observation based on the last classification. In this lecture notes, we will be discussing the boosting method applied to a classification problem.
Boosting refers to a family of algorithms that are able to convert weak learners to strong learners. The main principle of boosting is to fit a sequence of weak learners− models that are only slightly better than random guessing, such as small decision trees− to weighted versions of the data. More weight is given to examples that were misclassified by earlier rounds.
The predictions are then combined through a weighted majority vote (classification) or a weighted sum (regression) to produce the final prediction. The principal difference between boosting and the committee methods, such as bagging, is that base learners are trained in sequence on a weighted version of the data.
Gradient Boosting has three main components:
Lets look at a binary classification problems which aims at predicting the fate of the passengers on Titanic based on a few features: their age, gender, etc. We will take only a subset of the dataset and choose certain columns, for convenience. Our dataset looks something like this
where 'Pclass' (Passenger Class), is categorical - 1, 2, or 3;
'Age' is the age of the passenger when they were on the Titanic;
'Fare' is the Passenger Fare; 'Sex' is the gender of the person;
'Survived' (label) refers to whether or not the person survived the crash; 0 if they did not, 1 if they did.
Now, we start with one leaf node that predicts the initial value for every individual passenger. For a classification problem, it will be the log(odds) of the target value. log(odds) is the equivalent of average in a classification problem. Since four passengers in our case survived, and two did not survive, log(odds) that a passenger survived would be:
If the probability of surviving is greater than 0.5, then we first classify everyone in the training dataset as survivors. (0.5 is a common threshold used for classification decisions made based on probability; the threshold can easily be taken as something else.)
Now we need to calculate the Pseudo Residual, i.e, the difference between the observed value and the predicted value. Let us draw the residuals on a graph.
The blue and the yellow dots are the observed values. The blue dots are the passengers who did not survive with the probability of 0 and the yellow dots are the passengers who survived with a probability of 1. The dotted line here represents the predicted probability which is 0.7
We need to find the residual which would be :
Here, 1 denotes Yes and 0 denotes No.
We will use this residual to get the next tree.
We use a limit of two leaves here to simplify our example, but in reality, Gradient Boost has a range between 8 leaves to 32 leaves.
Because of the limit on leaves, one leaf can have multiple values. Predictions are in terms of log(odds) but these leaves are derived from probability which cause disparity. So, we can't just add the single leaf we got earlier and this tree to get new predictions because they're derived from different sources. We have to use some kind of transformation. The most common form of transformation used in Gradient Boost for Classification is :
The numerator in this equation is sum of residuals in that particular leaf.
The denominator is sum of (previous prediction probability for each residual ) * (1 - same previous prediction probability).
The first leaf has only one residual value that is 0.3, and since this is the first tree, the previous probability will be the value from the initial leaf, thus, same for all residuals. Hence,
For the second leaf,
Similarly, for the last leaf:
Now the transformed tree looks like:
Now that we have transformed it, we can add our initial lead with our new tree with a learning rate.
Learning Rate is used to scale the contribution from the new tree. This results in a small step in the right direction of prediction. Empirical evidence has proven that taking lots of small steps in the right direction results in better prediction with a testing dataset i.e the dataset that the model has never seen as compared to the perfect prediction in 1st step. Learning Rate is usually a small number like 0.1
We can now calculate new log(odds) prediction and hence a new probability.
For example, for the first passenger, Old Tree = 0.7. Learning Rate which remains the same for all records is equal to 0.1 and by scaling the new tree, we find its value to be -0.16. Hence, substituting in the formula we get:
Similarly, we substitute and find the new log(odds) for each passenger and hence find the probability. Using the new probability, we will calculate the new residuals.
This process repeats until we have made the maximum number of trees specified or the residuals get very small.
We can predict the log likelihood of the data given the predicted probability
The goal would be to maximize the log likelihood function. Hence, if we use the log(likelihood) as our loss function where smaller values represent better fitting models then:
Now that we have converted the p to log(odds), this becomes our Loss Function.
We have to show that this is differentiable.
This can also be written as :
Here, yi is the observed values, L is the loss function, and gamma is the value for log(odds).
We are summating the loss function i.e. we add up the Loss Function for each observed value.
argmin over gamma means that we need to find a log(odds) value that minimizes this sum.
Then, we take the derivative of each loss function :
... and so on.
This step needs you to calculate the residual using the given formula. We have already found the Loss Function to be as :
Hence,
Fit a regression tree to the residual values and create terminal regions
Because the leaves are limited for one branch hence, we might have more than one value in a particular terminal region.
In our first tree, m=1 and j will be the unique number for each terminal node. So R11, R21 and so on.
For each leaf in the new tree, we calculate gamma which is the output value. The summation should be only for those records which goes into making that leaf. In theory, we could find the derivative with respect to gamma to obtain the value of gamma but that could be extremely wearisome due to the hefty variables included in our loss function.
Substituting the loss function and i=1 in the equation above, we get:
We use second order Taylor Polynomial to approximate this Loss Function :
There are three terms in our approximation. Taking derivative with respect to gamma gives us:
Equating this to 0 and subtracting the single derivative term from both the sides
Then, gamma will be equal to :
The gamma equation may look humongous but in simple terms, it is :
We will just substitute the value of derivative of Loss Function
Now we shall solve for the second derivative of the Loss Function. After some heavy computations, we get :
We have simplified the numerator as well as the denominator. The final gamma solution looks like :
We were trying to find the value of gamma that when added to the most recent predicted log(odds) minimizes our Loss Function. This gamma works when our terminal region has only one residual value and hence one predicted probability. But, do recall from our example above that because of the restricted leaves in Gradient Boosting, it is possible that one terminal region has many values. Then the generalized formula would be:
Hence, we have calculated the output values for each leaf in the tree.
This formula is asking us to update our predictions now. In the first pass, m =1 and we will substitute F0(x), the common prediction for all samples i.e. the initial leaf value plus nu, which is the learning rate into the output value from the tree we built, previously. The summation is for the cases where a single sample ends up in multiple leaves.
Now we will use this new F1(x) value to get new predictions for each sample.
The new predicted value should get us a little closer to actual value. It is to be noted that in contrary to one tree in our consideration, gradient boosting builds a lot of trees and M could be as large as 100 or more.
This completes our for loop in Step 2 and we are ready for the final step of Gradient Boosting.
If we get a new data, then we shall use this value to predict if the passenger survived or not. This would give us the log(odds) that the person survived. Plugging it into 'p' formula:
If the resultant value lies above our threshold then the person survived, else did not.