Introduction

A Support-vector machine is in essence a supervised learning model with associated learning algorithm used for classification and regression challenges. Classifying data is an important task in machine learning; Given data points each belonging to one of two classes, and the goal to decide which class a new data point will be in. The new data point is viewed as a p-dimensional vector and we want to know whether we can separate such points with a (p − 1) dimensional hyperplane. This is called a linear classifier.

The followings are important concepts in SVM [1]−

  • Support Vectors − Datapoints that are closest to the hyperplane is called support vectors. Separating line will be defined with the help of these data points.
  • Hyperplane − It is the decision plane or space which is divided between a set of objects having different classes.
  • Margin − It is defined as the gap between two lines on the closet data points of different classes i.e. the perpendicular distance from the line to the support vectors. The bigger the margin, the better.

Working of SVM

From [2] and [3] we see in this algorithm, we plot each data item as a point in n-dimensional space, where n is the number of features, with the value of each coordinate being the value of each feature. We then perform classification by finding the optimal hyper-plane which divides the two classes. To separate the two classes of data points, there are many possible hyperplanes that could be chosen. We find a plane that has the largest margin, i.e. the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence and hence lower the generalization error of the classifier. The hyperplanes are decision boundaries that help us classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes.

Gradient
Fig.1 - Support vectors

Three hyperplanes H1,H2 and H3 are shown in the figure below, here H1 does not separate the classes; H2 does, but only with a small margin and H3 separates them with the maximal margin.

Gradient
Fig.2 - Hyperplanes

Linear SVM

We are given a training dataset of ${\displaystyle n} $ points of the form

$ {\displaystyle ({\vec {x}}_{1},y_{1}),\ldots ,({\vec {x}}_{n},y_{n}),} $

where the ${\displaystyle y_{i}}$ are either 1 or −1, each indicating the class to which the point ${\displaystyle {\vec {x}}_{i}}$ belongs. Each ${\vec {x}}_{i}$ is a $ {\displaystyle p} $-dimensional real vector. We want to find the "maximum-margin hyperplane" that divides the group of points ${\displaystyle {\vec {x}}_{i}} $ for which $ {\displaystyle y_{i}=1} $ from the group of points for which $ {\displaystyle y_{i}=-1} $, which is defined so that the distance between the hyperplane and the nearest point $ {\displaystyle {\vec {x}}_{i}}$ from either group is maximized.

Any hyperplane can be written as the set of points ${\displaystyle {\vec {x}}} $ satisfying

$ {\displaystyle {\vec {w}}\cdot {\vec {x}}-b=0,} $

where $ {\displaystyle {\vec {w}}}$ is perpendicular vector to the hyperplane. The parameter ${\displaystyle {\tfrac {b}{\|{\vec {w}}\|}}} $ determines the offset of the hyperplane from the origin along the normal vector ${\displaystyle {\vec {w}}} $

Gradient
Fig.3 - Maximum margin hyperplane for an SVM[4]

Hard Margin

As explained in [4], We select two parallel hyperplanes that separate the two classes of data such that the distance between them is maximum. The region bounded by these two hyperplanes is the 'margin", and the maximum-margin hyperplane is the hyperplane that passes through the midpoint between them. With a normalized or standardized dataset, these hyperplanes can be described by the equations

${\displaystyle {\vec {w}}\cdot {\vec {x}}-b=1}$ and ${\displaystyle {\vec {w}}\cdot {\vec {x}}-b=-1} $

The distance between these two hyperplanes is $ {\displaystyle {\tfrac {2}{\|{\vec {w}}\|}}} $ so to maximize the distance between the planes we want to minimize ${\displaystyle \|{\vec {w}}\|} $. The distance is computed using the distance from a point to a plane equation. To prevent data points from falling into the margin, we add the following constraint: for each ${\displaystyle i} $ either

${\displaystyle {\vec {w}}\cdot {\vec {x}}_{i}-b\geq 1} ,$if ${\displaystyle y_{i}=1}$ or $ {\displaystyle {\vec {w}}\cdot {\vec {x}}_{i}-b\leq -1} $, if $ {\displaystyle y_{i}=-1}$

These constraints state that each data point must lie on the correct side of the margin. This can be rewritten as ${\displaystyle y_{i}({\vec {w}}\cdot {\vec {x}}_{i}-b)\geq 1,\quad {\text{ for all }}1\leq i\leq n.\qquad \qquad } $

Soft Margin

To extend SVM to cases in which the data are not linearly separable, we use the hinge loss function.

$ {\displaystyle \max \left(0,1-y_{i}({\vec {w}}\cdot {\vec {x}}_{i}-b)\right).} $

Know that ${\displaystyle y_{i}} $ is the i-th target and $ {\displaystyle {\vec {w}}\cdot {\vec {x}}_{i}-b} $ is the i-th output. This function will become zero if the above described constraint is satisfied, in other words, if $ {\displaystyle {\vec {x}}_{i}} $ lies on the correct side of the margin. For data on the wrong side of the margin, the function's value is proportional to the distance from the margin. The goal of the optimization then is to minimize

${\displaystyle \left[{\frac {1}{n}}\sum _{i=1}^{n}\max \left(0,1-y_{i}({\vec {w}}\cdot {\vec {x}}_{i}-b)\right)\right]+ C \lVert {\vec {w}}\rVert ^{2},} $

where the parameter C determines the trade-off between increasing the margin size and ensuring that the $ {\displaystyle {\vec {x}}_{i}} $ lie on the correct side of the margin. Thus, for sufficiently small values of C, the second term in the loss function will become negligible, and hence it will behave similar to the hard-margin SVM above, if the input data are linearly classifiable.

Course Instructor:

Lecture Note Written By: