Lecture Notes of 21/10/2020 and 22/10/2020 by Himanshu Bimal

MathJax TeX Test Page

Support Vector Machines (SVM) and Kernel trick

What is a Support Vector Machine?

A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for classification purpose. The theoretical basis for SVM is the idea of finding a hyperplane that divides a given dataset into two classes in the best possible manner.

Problem of Non-linear decision boundary (hyperplane)

It is possible that in many of the problems of classification we may not have a linear decision boundary or hyperplane. In such cases we use the support vector machines to do the classification by producing nonlinear boundaries which are constructed as a result of linear boundary in a higher and transformed version of the feature space.

Working of a Support Vector Machine

A support vector machine (SVM) constructs a hyperplane or set of hyperplanes in a higher dimensional space. A "good" separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class, that is, we try to find a decision boundary that maximizes the margin. We do this in order to minimize the generalization error as much as possible, since the larger the margin the lower is the generalization error of the classifier.
The problem of non-linear decision boundary is solved by using the kernel tricks.

The Kernel Trick

In most of the ideal cases the data is linear and thus we can find a separating hyperplane that divides the data into two classes. It may happen in most of the practical situations that the data is not linear and hence the dataset is inseparable, in such cases we use the kernel trick which maps or transforms the input data into a higher dimensional space non-linearly. The new transformation that we get after this process can be separated linearly. In layman terms it can be said that the kernel trick allows the SVM to form non-linear boundaries.

How is the Kernel trick applied?

In order to train a support vector classifier and optimize our objective function, we would have to perform operations in the higher dimension feature space with the higher dimensional vectors. In practical situations, there might be many features in the data and applying transformations that involve many polynomial combinations of these features will lead to extremely high computational costs which might not be at all useful for us. So we use the kernel trick to represent the data only through a set of pairwise similarity comparisons between the original data observations x with the original coordinates in the original dimensional space, instead of applying the transformations $\phi(x)$ and representing the data by these transformed coordinates in the higher dimensional feature space.

The Mathematical set up

The mathematical setting for this can be described as, we have a hyperplane of the form $f(x) = wx + b $ which is non-linear classifier in the original space. Then after applying the transformation $\phi(x)$ we get a linear classifier of the form $f(x) = w_{0}\phi(x) + b_{0} $ . A Kernel function is defined to be a function that takes input as vectors in the original space and returns the dot product of vectors in the higher dimension feature space. Mathematically it is defined as, if we have data vectors as $ X = \big\{x_i\big\}_{i=1}^n$ and $ Y = \big\{y_i\big\}_{i=1}^n$ and a transformation $ \phi : D \rightarrow R^n $ (where $ D$ is the space of dataset and $R^n$ is the real $n$- dimensional vector space) then kernel function is given as $ k(X,Y) = \phi(X).\phi(Y)$ where the dot $'.'$ represents the dot product in the feature space (in principle we take transpose of one vector, so that the multiplication of vectors makes sense). The 'trick' involved in this method is that, it enable us to perform operations in the input space rather the higher dimension feature space. So, we do not need to evaluate the dot product in the higher dimensional feature space.

The Dual problem

The problem that we need to solve is a constrained optimization problem. Such problems are solved using Lagrange multipliers, which helps us eliminate some unwanted variable and thus find the solution to the optimization problem. The problem posed in the case of SVM is, $$ min \frac{1}{2}||w||^2 $$ such that $$ y_i(w^Tx_i + b) \geq 1$$ Applying Lagrangian to this optimization with the constraint transformed as $$h_i(w,b) = -y_i(w^Tx_i + b) + 1 \leq 0 $$ we get the Lagrangian objective as, $$L(w,b,\alpha) = \frac{1}{2}||w||^2 - \sum_{i=1}^{n}\alpha_i (y_i(w^Tx_i + b) - 1)$$
$$ = \frac{1}{2}w^Tw - \sum_{i=1}^{n}\alpha_i y_iw^Tx_i - \sum_{i=1}^{n}\alpha_i y_i b + \sum_{i=1}^{n}\alpha_i $$
$$ = \frac{1}{2} \sum_{i,j =1}^{n}\alpha_i \alpha_j y_i y_j x_i^T x_j - \sum_{i,j =1}^{n}\alpha_i \alpha_j y_i y_j x_i^T x_j - \sum_{i=1}^{n}\alpha_i y_i b + \sum_{i=1}^{n}\alpha_i $$
$$ = \sum_{i=1}^{n}\alpha_i - \frac{1}{2} \sum_{i,j =1}^{n}\alpha_i \alpha_j y_i y_j x_i^T x_j $$ Hence in this process we have eliminated $w,b$ which were unwanted variables for us. Now the solution of minimizing $L(w,b,\alpha)$ w.r.t $ w,b$ subject to $ \alpha \geq 0 $ is same as the solution of maximizing $L(w,b,\alpha)$ w.r.t $ \alpha $ and subject to some appropriate constraints. This type of min-max optimization problems are dual problem in our context. Now, since we have eliminated $w,b$ so we can write $L(w,b,\alpha)$ as just $L(\alpha)$. At last we know the problem that needs to be solved in the dual form for the linear classifier situation, $$ max \big\{L(\alpha)\} $$ subject to $$ \alpha_i \geq 0 $$ for all $i = 1,2,...,n$ and $$ \sum_{i=1}^{n}\alpha_iy_i = 0 $$ where, $$ L(\alpha) = \sum_{i=1}^{n}\alpha_i - \frac{1}{2} \sum_{i,j =1}^{n}\alpha_i \alpha_j y_i y_j x_i^T x_j $$

Example of a Kernel trick

There are mainly three types of kernel functions that we encounter or choose in SVM context. These are $n^{th}$ degree polynomials, radial basis function and neural network kernel. We give an example of square kernel, that is, $k(X,Y) = (X^T.Y)^2$

Radial Basis Function (Gaussian Kernel)

The Radial basis function is a popular kernel also known as Gaussian Kernel. We define this kernel as, $$ k(X,Y) = exp\Big(-\frac{||X-Y||^2}{2\sigma^2}\Big) $$ or $$ k(X,Y) = exp\Big(-\gamma ||X-Y||^2\Big) $$ where $$ \gamma = \frac{1}{2\sigma^2} $$ and $$||.||$$ denotes the Euclidean norm.

Importance of Radial basis function

The Gaussian Kernel corresponds to an infinite dimensional feature space mapping (this is due to Taylor's expansion of exponential involved in this).
The Gaussian Kernel gives us a resonable measure of similarity between any two vectors considered. Since, kernel looks like $ exp(-distance)$, so for close vectors the kernel becomes 1 and for far vectors it becomes zero.

The Hyperparameters in Gaussian kernel

There are two hyperparameters in Gaussian kernel. The first one is $\gamma$ or $\sigma$. It gives us the spread of the data and is also known as the shape parameter. For small $\sigma$ we have less spread of data and a large peak but for lagre $\sigma$ we have more variance but small peak.
The second one is $C$, also known as regularization parameter. The $C$ parameter trades off correct classification of training examples against maximization of the decision function’s margin. For larger values of $C$, a smaller margin will be accepted if the decision function is better at classifying all training points correctly. A lower $C$ will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy.

Some Important conclusions about $\gamma$ and $C$

For low $\gamma$, we have a broad decision boundary but high $\gamma$ creates islands of decision boundaries around the data points.
For a small value of $C$, the classifier is fine with misclassified data points so in such situation we have high bias and low variance. But in the case of large $C$, the classifier is heavily penalized for misclassified data and hence in this case we have low bias and high variance.
For low $C$, the model underfits and for large $C$, the model overfits.

Support vectors in SVM

The support vectors are the data points that lie closest to the decision boundaries or hyperplanes. They are the data points that are most difficult to classify. As the name 'support' goes, the decision boundary or hyperplane is supported on these vectors and moving or removing them would change the decision boundaries. And so the entire SVM model is based on the concept of support vectors.

Advantages and Disadvantages of SVM

Advantages

We can solve problems even when the data sets given in non-linear way, since we have the option of non-linear decision boundary.
SVM offers very high accuracy compared to Logistic Regression and decision tree.
SVM comes with the kernel trick and hence the cost of computation is not too high.

Disadvantages

We need to choose our kernel very carefully according to our given problem. This choice may be difficult at times.
Support vectors may be noisy sometimes and hence it may be difficult to use SVM.

References

RBF in SVM.
Kernel trick.
Support vectors diagram.
The Elements of Statistical learning by Trevor Hastie, et al. .