A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for classification purpose. The theoretical basis for SVM is the idea of finding a hyperplane that divides a given dataset into two classes in the best possible manner.
It is possible that in many of the problems of classification we may not have a linear decision boundary or hyperplane. In such cases we use the support vector machines to do the classification by producing nonlinear boundaries which are constructed as a result of linear boundary in a higher and transformed version of the feature space.
A support vector machine (SVM) constructs a hyperplane or set of hyperplanes in a higher dimensional space. A "good" separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any
class, that is, we try to find a decision boundary that maximizes the margin. We do this in order to minimize the generalization error as much as possible, since the larger the margin the lower is the generalization error of the classifier.
The problem of non-linear decision boundary is solved by using the kernel tricks.
In most of the ideal cases the data is linear and thus we can find a separating hyperplane that divides the data into two classes. It may happen in most of the practical situations that the data is not linear and hence the dataset is inseparable, in such cases we use the kernel trick which maps or transforms the input data into a higher dimensional space non-linearly. The new transformation that we get after this process can be separated linearly. In layman terms it can be said that the kernel trick allows the SVM to form non-linear boundaries.
In order to train a support vector classifier and optimize our objective function, we would have to perform operations in the higher dimension feature space with the higher dimensional vectors. In practical situations, there might be many features in the data and applying transformations that involve many polynomial combinations of these features will lead to extremely high computational costs which might not be at all useful for us. So we use the kernel trick to represent the data only through a set of pairwise similarity comparisons between the original data observations x with the original coordinates in the original dimensional space, instead of applying the transformations \(\phi(x)\) and representing the data by these transformed coordinates in the higher dimensional feature space.
The mathematical setting for this can be described as, we have a hyperplane of the form \(f(x) = wx + b \) which is non-linear classifier in the original space. Then after applying the transformation \(\phi(x)\) we get a linear classifier of the form \(f(x) = w_{0}\phi(x) + b_{0} \) . A Kernel function is defined to be a function that takes input as vectors in the original space and returns the dot product of vectors in the higher dimension feature space. Mathematically it is defined as, if we have data vectors as \( X = \big\{x_i\big\}_{i=1}^n\) and \( Y = \big\{y_i\big\}_{i=1}^n\) and a transformation \( \phi : D \rightarrow R^n \) (where \( D\) is the space of dataset and \(R^n\) is the real \(n\)- dimensional vector space) then kernel function is given as \( k(X,Y) = \phi(X).\phi(Y)\) where the dot \('.'\) represents the dot product in the feature space (in principle we take transpose of one vector, so that the multiplication of vectors makes sense). The 'trick' involved in this method is that, it enable us to perform operations in the input space rather the higher dimension feature space. So, we do not need to evaluate the dot product in the higher dimensional feature space.
The problem that we need to solve is a constrained optimization problem. Such problems are solved using Lagrange multipliers, which helps us eliminate some unwanted variable and thus find the solution to the optimization problem. The problem posed in the case of SVM is,
$$ min \frac{1}{2}||w||^2 $$ such that $$ y_i(w^Tx_i + b) \geq 1$$
Applying Lagrangian to this optimization with the constraint transformed as $$h_i(w,b) = -y_i(w^Tx_i + b) + 1 \leq 0 $$ we get the Lagrangian objective as,
$$L(w,b,\alpha) = \frac{1}{2}||w||^2 - \sum_{i=1}^{n}\alpha_i (y_i(w^Tx_i + b) - 1)$$
$$ = \frac{1}{2}w^Tw - \sum_{i=1}^{n}\alpha_i y_iw^Tx_i - \sum_{i=1}^{n}\alpha_i y_i b + \sum_{i=1}^{n}\alpha_i $$
$$ = \frac{1}{2} \sum_{i,j =1}^{n}\alpha_i \alpha_j y_i y_j x_i^T x_j - \sum_{i,j =1}^{n}\alpha_i \alpha_j y_i y_j x_i^T x_j - \sum_{i=1}^{n}\alpha_i y_i b + \sum_{i=1}^{n}\alpha_i $$
$$ = \sum_{i=1}^{n}\alpha_i - \frac{1}{2} \sum_{i,j =1}^{n}\alpha_i \alpha_j y_i y_j x_i^T x_j $$
Hence in this process we have eliminated \(w,b\) which were unwanted variables for us. Now the solution of minimizing \(L(w,b,\alpha)\) w.r.t \( w,b\) subject to \( \alpha \geq 0 \) is same as the solution of maximizing \(L(w,b,\alpha)\) w.r.t \( \alpha \) and subject to some appropriate constraints. This type of min-max optimization problems are dual problem in our context. Now, since we have eliminated \(w,b\) so we can write \(L(w,b,\alpha)\) as just \(L(\alpha)\). At last we know the problem that needs to be solved in the dual form for the linear classifier situation,
$$ max \big\{L(\alpha)\} $$ subject to $$ \alpha_i \geq 0 $$ for all \(i = 1,2,...,n\) and $$ \sum_{i=1}^{n}\alpha_iy_i = 0 $$
where,
$$ L(\alpha) = \sum_{i=1}^{n}\alpha_i - \frac{1}{2} \sum_{i,j =1}^{n}\alpha_i \alpha_j y_i y_j x_i^T x_j $$
There are mainly three types of kernel functions that we encounter or choose in SVM context. These are \(n^{th}\) degree polynomials, radial basis function and neural network kernel. We give an example of square kernel, that is, \(k(X,Y) = (X^T.Y)^2\)
The Radial basis function is a popular kernel also known as Gaussian Kernel. We define this kernel as, $$ k(X,Y) = exp\Big(-\frac{||X-Y||^2}{2\sigma^2}\Big) $$ or $$ k(X,Y) = exp\Big(-\gamma ||X-Y||^2\Big) $$ where $$ \gamma = \frac{1}{2\sigma^2} $$ and $$||.||$$ denotes the Euclidean norm.
There are two hyperparameters in Gaussian kernel. The first one is \(\gamma\) or \(\sigma\). It gives us the spread of the data and is also known as the shape parameter. For small \(\sigma\) we have less spread of data and a large peak but for lagre \(\sigma\) we have more variance but small peak.
The second one is \(C\), also known as regularization parameter. The \(C\) parameter trades off correct classification of training examples against maximization of the decision function’s margin. For larger values of \(C\), a smaller margin will be accepted if the decision function is better at classifying all training points correctly. A lower \(C\) will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy.
The support vectors are the data points that lie closest to the decision boundaries or hyperplanes. They are the data points that are most difficult to classify. As the name 'support' goes, the decision boundary or hyperplane is supported on these vectors and moving or removing them would change the decision boundaries. And so the entire SVM model is based on the concept of support vectors.