Regression Using SVM

Susobhan Bandopadhyay


1. Overview

Support Vector Machines (SVMs) are well known in classification problems, first identified by Vladimir Vapnik and his colleagues in 1992. However, the use of SVMs in regression is not as well documented. SVM regression is considered a nonparametric technique because it relies on kernel functions. The Support Vector Regression (SVR) uses the same principles as the SVM for classification, there are only a few minor differences. First, as output is a real number it becomes very difficult to predict the information at hand, which has infinite possibilities. In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM which would have already requested from the problem. Not only that, there is also a more complicated reason, the algorithm is more complicated therefore to be taken in consideration. The main goal here is following: to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind that part of the error is tolerated. In other words, the goal is to find a function f(x) that deviates from yn by a value no greater than ε for each training point x, and at the same time is as flat as possible.


2. Linear SVR : Primal Formula

Let, we have a set of training data where $x_i$ is a multivariate set of $N$ observations with label values $y_i$. To find the linear function, $$f(x) = xw+b$$ and ensure that it is as flat as possible, find $f(x)$ with the minimal norm value $w^{′}w.$ This is formulated as a convex optimization problem to minimize $$J(w)= \frac{1}{2} \cdot w^{′}w$$ subject to all residuals having a value less than ε; or, in equation form: $$∀i: |y_n−(x_iw+b)| ≤ ε . $$


2.1 Introducing Slack Variables : Another Hyperparameter

It might be the case that no such function $f(x)$ that satisfies these constraints for all points. So, introduce slack variables $ξ_i$ and $ξ^{*}_i $ for each point. This approach is similar to the “soft margin” concept in SVM classification, because the slack variables allow regression errors to exist up to the value of $ξ_i$ and $ξ^{*}_i $, yet still satisfy the required conditions. Including slack variables leads to the objective function, also known as the primal formula: $$J(w)=\frac{1}{2} \cdot w^{′}w+C\sum_1^n(ξ_i+ξ^{*}_i );$$ subject to: $$∀i: y_i−(x_i w+b)≤ε+ξ_i$$ $$∀i:(x_i w+b)−y_i≤ε+ ξ^{*}_i$$ $$∀i: ξ^{*}_i ≥0$$ $$∀i:ξ_n≥0.$$ We now have an additional hyperparameter, C, that we can tune. As C increases, our tolerance for points outside of ϵ also increases. As C approaches 0, the tolerance approaches 0 and the equation collapses into the simplified (although sometimes infeasible) one.


3. Non Linear SVR: Primal Formula

Some regression problems can not be described properly using a linear model. There we need to extend the previously-described technique to nonlinear functions. To achieve a nonlinear SVR model we replace the dot product $x_1^{'}x_2$ with a nonlinear kernel function $G(x_1,x_2) = <φ(x_1),φ(x_2)>$, where $φ(x)$ is a transformation, which maps x to a high-dimensional space. Here are some kernel functions.

Kernel Kernel Function
Linear $G(x_j,x_k)=x_j^{′}x_k$
Gaussian $G(x_j,x_k )=exp(-||x_j−x_k||)^2$
Polynomial $ G(x_j,x_k)=(1+x_j^{'}x_k)q, \text{ Where }, q \in \{2,3\} $

Left: Before Transformation; Right: After Transformation into Higher Dimensional Space

4. Implementation (Python Code)

               
                   
# Reading Data
import pandas as pd

data = pd.read_csv('sample.csv')
#data = pd.read_csv('weight.csv')
print(data.shape)

data.columns = ['X1','X2', 'Y']
X = data['X1']
Y = data['X2']
'''
data.columns = ['index','one', 'weight','age', 'fat']
X = data['weight']
Y = data['age']
'''
data.head()

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; 
sns.set_style("ticks")
import numpy as np

# Plot the data

sns.scatterplot(X,Y)

# Import the model
#import numpy as np
from sklearn.svm import SVR
#import matplotlib.pyplot as plt

# Build the Model 
svr_rbf = SVR(kernel='rbf', C=0.01, gamma=0.1, epsilon=.4)
svr_lin = SVR(kernel='linear', C=.001, gamma=0.3, epsilon=.5)
#svr_poly = SVR(kernel='poly', C=100, gamma='auto', degree=3, epsilon=.1, coef0=1)



# Convert the series data type to matrix (scikit requirement)
#X_matrix = X.to_numpy().reshape(len(X),1)
X = X.to_numpy().reshape(len(X),1)

# Fit the training data
#model = lreg.fit(X_matrix,Y)
#model = svr_lin.fit(X_matrix,Y)
#model = svr_rbf.fit(X_matrix,Y)
#model = svr_poly.fit(X_matrix,Y)

#print("Coefficient - ",model.coef_[0])
#print("Intercept - ", model.intercept_)

#Y_fit = model.predict(X_matrix)
# print(Y_fit)

lw=2

svrs = [svr_rbf, svr_lin]
kernel_label = ['RBF', 'Linear']
model_color = ['c', 'g']

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 10), sharey=True)
for ix, svr in enumerate(svrs):
    axes[ix].plot(X, svr.fit(X, Y).predict(X), color=model_color[ix], lw=lw, label='{} model'.format(kernel_label[ix]))
    axes[ix].scatter(X[svr.support_], Y[svr.support_], facecolor="none", edgecolor="k", s=50, label='other training data')
    axes[ix].scatter(X[np.setdiff1d(np.arange(len(X)), svr.support_)], 
          Y[np.setdiff1d(np.arange(len(X)), svr.support_)], facecolor=model_color[ix], 
          edgecolor=model_color[ix], s=50, label='{} support vectors'.format(kernel_label[ix]))
    axes[ix].legend(loc='upper center', bbox_to_anchor=(0.5, 1.1), ncol=1, fancybox=True, shadow=True)

fig.text(0.5, 0.04, 'data', ha='center', va='center')
fig.text(0.06, 0.5, 'target', ha='center', va='center', rotation='vertical')
fig.suptitle("Support Vector Regression", fontsize=14)

'''
sns.scatterplot(X,Y)
#sns.lineplot(model.coef_[0],model.intercept_)
sns.lineplot(X,Y_fit,color=".2")
'''
               

Left: RBF kernel (C=0.01, gamma= 0.1, epsilon=0.4) ; Right: Linear kernel (C=0.001, gamma= 0.3, epsilon=0.5)

5. Conclusion

SVR is a powerful algorithm that allows us to choose how tolerant we are of errors, both through an acceptable error margin($ϵ$) and through tuning our tolerance of falling outside that acceptable error rate.


Refrences

  1. Vapnik, V. The Nature of Statistical Learning Theory. Springer, New York, 1995.
  2. Huang, T.M., V. Kecman, and I. Kopriva. Kernel Based Algorithms for Mining Huge Data Sets: Supervised, Semi-Supervised, and Unsupervised Learning. Springer, New York, 2006.
  3. Platt, J. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Technical Report MSR-TR-98–14, 1999.
  4. https://www.saedsayad.com/support_vector_machine_reg.html