To avoid underfitting and overfitting in an algorithm, its bias and variance must be tuned. Boosting is such a method to reduce bias.
Boosting is the process through which weak learners are converted to strong learners. The main principle of boosting is that a sequence of weak learning models are fitted on a weighted version of dataset and the misclassified examples by previous learner are given more weight. The predictions are then combined through a weighted sum (for regression) to produce the final prediction. A weak learning algorithm is defined as the algorithm whose performance is at least slightly better than random chance.
The gradient boosting algorithm uses the gradient descent to minimize the loss. It calculates the residual (difference between the current predicted and the known target value). After that gradient boosting regression trains a weak model that maps features to this residual which is then added to the existing model input and the process leads to the more accurate predicted value. Repeating this step improves the overall model prediction.
To implement gradient boosting regression, following steps are followed:
Here, we will discuss some important parameters used in sklearn.ensemble.GradientBoostingRegression class.
I will be using data from this site. We have some y-value corresponding to some x-values. We will extract x and y from the csv file.
Now we are going to train the model. We have imported the GradientBoostingRegressor class from sklearn.ensemble. Then we are creating the instance gradient_br of the class GradientBoostingRegressor with some parameters and its value. After that we are calling the fit method on the model instance gradient_br. Here, we set the parameter values as n_estimators=5, max_depth=3 and learning_rate=1.
Now lets visualize the model that we have created. The graph shows predicted value vs x. It looks like a good fit. We have used pyplot to plot the graph.
Now, we will see how good model fits the data quantitatively.
We can see that our model's fitment score is around 99.44%. The Github link for the above code and the csv file is given here.