CS460 Assignment

Lecture 7

Submitted by:
Aman Upadhyay | Roll number 1711017 | aman.upadhyay@niser.ac.in |
Assigned by Dr. Subhankar Mishra | School of Computer Sciences | NISER, Bhubaneswar


Regression

Introduction

Regression is a supervised machine learning algorithm that can use a pre-embedded statistical method to generate a model [1]. Regression is a technique used for the modeling and analysis of numerical data which exploits the relationship between two or more variables so that we can gain information about one of them through knowing the values of the other. When there is only one independent variable in the linear regression model, the model is generally termed as a simple linear regression model. When there is more than one independent variable in the model, then the linear model is termed as the multiple linear regression model [2].

The Linear Model

Consider a simple linear regression model:

eq

fw,b is the model and w, b are the parameters. The model fw,b is trained on input {xi,yi}i=1 N such that given an xi, this model produces the value yi. The optimized parameters are denoted by w* and b*[1].

This model can be used for interpolation and extrapolation of data. Both the methods are used to guess values given a model. The range of a linear regression model is (xmin,xmax) where x is the independent variable of the training data.

  1. Interpolation is when the guessed is in the range.
  2. Extrapolation is when the guessed value is out of the range.
Interpolation is safer than extrapolation as we can be sure that the trend of the data used is followed for x in the range, while the same can not be said for extrapolation.
plot
Fig 1: An example of linear regression

Loss Function

All the algorithms in machine learning rely on minimizing or maximizing a function, which we call “objective function”. The group of functions that are minimized is called “loss functions”. A loss function is a measure of how good a prediction model does in terms of being able to predict the expected outcome. A common loss function used is "mean square error"[3]:

eq

The loss function is a sum of the penalty for misclassification of examples. The Squared is useful as we do not need the + or - information from the error, we just need to know the amount of error, hence an even power, absolute is not used cause it's not differentiable on all x and higher even power magnify the outliers a lot.

loss
Fig 2: Calculation of loss

Closed-Form Solution

The optimal parameters w and b for the model is found by minimizing the loss function, and a closed form solution is available for simple linear regression:

eq

Why Linear Regression?

There are multiple advantages of simple linear regression over non-linear regressions:

  1. Simplicity: The model is simple and can be easily interpreted using a plot.
  2. The solution can be easily found.
  3. No possibility of overfitting the model.

Implementation & Graphical Interpretation

We are using a small dataset which is tabulated below

x 1 2 3 4 5
y 3 2 5 4 3
we first define some functions for calculating mean, slope, intercept, and loss.
import numpy as np
import matplotlib.pyplot as plt  # To visualize


def mean(z):       #to find mean in mse
    return sum(z) / len(z)


def w_func(x_, y_): # to find the slope
    xy = [i * j for i, j in zip(x_, y_)]
    y_mean_x = [mean(y_) * i for i in x_]
    x2 = np.square(x_)
    x_mean_x = [mean(x_) * i for i in x_]
    numerator = [i - j for i, j in zip(xy, y_mean_x)]
    denominator = [i - j for i, j in zip(x2, x_mean_x)]
    return sum(numerator) / sum(denominator)


def b_func(x_, y_, w_): #to find the intercept
    x_mean = sum(x_) / len(x_)
    y_mean = sum(y_) / len(y_)
    return y_mean - (w_ * x_mean)


def y_pred_func(b_, w_, x_):  #to find the predicted value
    return [b_ + (w_ * i) for i in x_]


def mse_func(y_, y_pred_):    #to find the mean square error
    error = [(i - j) ** 2 for i, j in zip(y_, y_pred_)]
    return mean(error)

Then we call these fuctions to get the linear regression model.
 
if __name__ == '__main__':
    x = [1, 2, 3, 4, 5] #dataset
    y = [3, 2, 5, 4, 3]
    print("X ::", x)
    print("Y ::", y)
    w = w_func(x, y)  #calculate slope
    print("w ::", w)
    b = b_func(x, y, w)   #calculate intercept
    print("b ::", b)
    y_pred = y_pred_func(b, w, x)  #sanity check, calculating predected value
    print("Y Predicted :", y_pred)
    mse = mse_func(y, y_pred)   # calculate the mse
    print("Mean square error :", mse)

    plt.scatter(x,y)
    plt.plot(x, y_pred, color='red')
    plt.show()

References

[1] Mishra, S. "Linear Regression". Lecture 7. CS460 Class Lecture (Lecture Notes, NISER, Bhubaneswar, India)

[2] Shalabh. Simple Linear Regression Analysis (Lecture Notes, IIT K, Kanpur, India). Retrieved from http://home.iitk.ac.in/~shalab/regression/Chapter2-Regression-SimpleLinearRegressionAnalysis.pdf

[3] Probabilty Course. Mean Squared Error. Retrieved from https://www.probabilitycourse.com/chapter9/9_1_5_mean_squared_error_MSE.php