Perceptron

Raj Abhishek

The perceptron takes its roots from the biological neuron. In a neuron, the dendrite receives electrical signals from the axons of other neurons. These electrical signals can be simulated mathematically using numerical values. At the synapses between the dendrite and the axons, electrical signals are modulated in various amounts. This can be done mathematically by multiplying "weights" to each input value. An actual neuron fires an output signal only when the total strength of the input signals exceed a certain threshold. We simulate this by using an activation function, which is the weighted sum of the inputs. We use a threshold akin to the neuron, which would give the desired result only when the activation funtion exceeds this certain threshold.

Algorithm

We assume the following conditions for the algorithm:

The data is linearly separable.
The labels are either +1 or -1.

So, we have two sets of points that are linearly separable, i.e. there exists a hyperplane that perfectly separates the two sets. In such a case, we may write the equation of the hyperplane as: $$ H(x_1, x_2, x_3, ...., x_n) := w_1x_1 + w_2x_2 + w_3x_3 + .... + w_nx_n + b = 0$$ Mathematically speaking, $w_1,w_2, w_3, ..., w_n$ are coefficients of the variables $x_1, x_2, x_3, ..., x_n$, and $b$ is the intercept of the hyperplane. However, $w_1,w_2, w_3, ..., w_n$ are called weights and $b$ the bias.
This equation can be written in a more compact form as: $$ H(x_1, x_2, x_3, ...., x_n) := \{w_1, w_2, w_3, ..., w_n, b\} . \{x_1, x_2, x_3, ..., x_n, 1\} = \bar{w}.\bar{x} = 0$$ where the "." (dot) represents a dot product, with $$ \bar{w} = \{w_1, w_2, w_3, ..., w_n, b\}$$ $$ \bar{x} = \{x_1, x_2, x_3, ..., x_n, 1\}$$
If a point $x_i = \{x_{i1}, x_{i2}, ..., x_{in}\}$ lies above the hyperplane (with label $y_i = +1$), we have: $$ H(x_{i1}, x_{i2}, x_{i3}, ...., x_{in}) := \{w_1, w_2, w_3, ..., w_n, b\} . \{x_{i1}, x_{i2}, x_{i3}, ..., x_{in}, 1\} = \bar{w}.\bar{x_i} > 0$$ If however, it lies below the hyperplane (with label $y_i = -1$): $$ H(x_{i1}, x_{i2}, x_{i3}, ...., x_{in}) := \{w_1, w_2, w_3, ..., w_n, b\} . \{x_{i1}, x_{i2}, x_{i3}, ..., x_{in}, 1\} = \bar{w}.\bar{x_i} < 0$$
Thus, we may condense both these inequalities into a single inequality as: $$ y_i(\bar{w}.\bar{x_i}) > 0$$ So, the classification of a particular datapoint would be correct if it obeys this inequality. If, however, the datapoint does not obey this inequality, we would have to update the weight as per the following equation: $$w = w + y_ix_i$$ Note: If the labels are of the opposite signs (from those mentioned here), then the weights change their sign too so as to match with the inequality.

Pseudo-code

Initialize the weights (including bias), $\bar{w} = \{0, 0, 0, ..., 0\}$
while TRUE:
Initialize misclassification count, $m = 0$
For $\bar{x}, y$ in input, label:
Check for misclassificaton, If $H \leq 0$:
Update the weights, $\bar{w} = \bar{w} + y\bar{x}$
Update the misclassification count, $m = m+1$
If no more misclasifications, If $m = 0$:
Break the loop

Implementation

We code and show the implementation on the following dataset. Directly plotting the dataset, by basic intuition, we see that it is indeed linearly separable.

We make use of the following Python code, based on the Algorithm mentioned earlier:


import numpy as np

num_inputs = 2
weights = np.zeros(num_inputs+1)

while True:
  m = 0
  for x, y in zip(np.c_[x_data, y_data], label):
    x = np.append(x, 1)
    if(y*np.dot(weights, x)<=0):
      weights += y*x
      m += 1
  if(m == 0):
    break

We get the weights as:


 [In: ] weights
 [Out: ] array([470., -65.,   5.])

We may then plot it using matplotlib:


import matplotlib.pyplot as plt
plt.plot(np.linspace(1,100,100), -((weights[0]*np.linspace(1,100,100))+weights[2])/weights[1])
plt.scatter(x_data[:50], y_data[:50], c='g')
plt.scatter(x_data[50:], y_data[50:], c='r')
plt.xlabel("x")
plt.ylabel("y")

We thus see that the line indeed divides the datasets into two. The equation of the line as shown earlier, is given by the weights: $$ H(x,y) := 470 x - 65 y + 5 = 0$$

Convergence

We shall now prove that the perceptron algorithm converges nicely. But, prior to the proof, we will make some simple assumptions, which have to be followed for the convergence to work:

Linear Separability: There exists some $\textbf{w}^{*}\in\mathbb{R}^d$ such that $||\textbf{w}^{*}|| = 1$ and for some $\gamma > 0$, for all $ i \in \{1, 2, 3, ..., n\}$: $$y^i(\textbf{w}^{*}.\textbf{x}^{i}) > \gamma$$
Bounded Coordinates: There exists $R \in \mathbb{R}$ such that for $ i \in \{1, 2, 3, ..., n\}$, $$||\textbf{x}^i||\leq R$$

Thus, from these two assumptions, we attempt to derive the Perceptron Convergence Theorem:

Perceptron Convergence Theorem: The Perceptron Learning Algorithm makes at most $\frac{R^2}{\gamma^2}$ updates (after which it returns a separating hyperplane).

Proof:

As the weights are all initialized to 0, at the first step, i.e. $k = 1$, we have: $$\textbf{w}^1=0$$ For some $k \geq 1$, let $x^j$ be the misclassified point. Then, we would have: $$\textbf{w}^{k+1}.\textbf{w}^* = (\textbf{w}^k + y^j \textbf{x}^j).\textbf{w}^*$$ $$\implies\textbf{w}^{k+1}.\textbf{w}^* = \textbf{w}^k.\textbf{w}^* + y^j \textbf{x}^j.\textbf{w}^*$$ $$\implies\textbf{w}^{k+1}.\textbf{w}^* > \textbf{w}^k.\textbf{w}^* + \gamma$$

where, the last line follows from Assumption 1.

Thus, by induction: $$\textbf{w}^{k+1}.\textbf{w}^* > \textbf{w}^k.\textbf{w}^* + \gamma$$ $$\hspace{2.5cm} > \textbf{w}^{k-1}.\textbf{w}^* + 2\gamma$$ $$\hspace{2.5cm} > \textbf{w}^{k-2}.\textbf{w}^* + 3\gamma$$ $$\hspace{2.5cm} .....$$ $$\hspace{2.5cm} .....$$ $$\hspace{3.25cm} > \textbf{w}^{1}.\textbf{w}^* + k\gamma = k\gamma$$ We have: $$\textbf{w}^{k+1}.\textbf{w}^* \leq ||\textbf{w}^{k+1}||\text{ }||\textbf{w}|| \equiv ||\textbf{w}^{k+1}||$$ And, $$||\textbf{w}^{k+1}||^2 = ||\textbf{w}^k + y^j \textbf{x}^j||^2$$ $$\hspace{5cm} = ||\textbf{w}^k||^2 + ||y^j \textbf{x}^j||^2 + 2y^j (\textbf{x}^j.\textbf{w}^k)$$ $$\hspace{4.6cm} = ||\textbf{w}^k||^2 + ||\textbf{x}^j||^2 + 2y^j (\textbf{x}^j.\textbf{w}^k)$$ $$\hspace{2.0cm} \leq ||\textbf{w}^k||^2 + ||\textbf{x}^j||^2$$ $$\hspace{1.5cm} \leq ||\textbf{w}^k||^2 + R^2$$

where, the third step is due to the label, $y_i$ being +1 or -1, and the last step follows from Assumption 2.

Thus, by induction: $$||\textbf{w}^{k+1}||^2 \leq ||\textbf{w}^k||^2 + R^2$$ $$\hspace{2.5cm} \leq ||\textbf{w}^{k-1}||^2 + 2R^2$$ $$\hspace{2.5cm} \leq ||\textbf{w}^{k-2}||^2 + 3R^2$$ $$\hspace{2.5cm} .....$$ $$\hspace{2.5cm} .....$$ $$\hspace{3.25cm} \leq ||\textbf{w}^{1}||^2 + kR^2 = kR^2$$ Summarizing, we have: $$||\textbf{w}^{k+1}|| > k\gamma$$ $$||\textbf{w}^{k+1}||^2 \leq kR^2$$ Thus, $$k^2\gamma^2 < ||\textbf{w}^{k+1}||^2 \leq kR^2$$ $$\implies k^2\gamma^2 < kR^2$$ $$\implies k < \frac{R^2}{\gamma^2}$$ So, provided our data follows the mentioned assumptions, the algorithm converges within $k= \frac{R^2}{\gamma^2}$ steps.

Code with Epochs

The code written above runs till convergence is attained, and when data that is not linearly separable, it would loop forever. However, one may want to impose the constraint on the number of loops rather than perfect convergence. In such a case, one may simply change the "while" loop in the code to a "for" loop with the desired number of epochs.


import numpy as np

num_inputs = 2
max_epochs = 1000
weights = np.zeros(num_inputs+1)

for _ in range(max_epochs):
  m = 0
  for x, y in zip(np.c_[x_data, y_data], label):
    x = np.append(x, 1)
    if(y*np.dot(weights, x)<=0):
      weights += y*x
      m += 1
  if(m == 0):
    break

We note here that the code still has the "break" statement and would break the loop if perfect convergence is attained before the "max_epochs" is reached.

References

[1] Course in Machine Learning, Chapter 4: The Peceptron.

[2] Cornell CS4780/CS 5780 Fall 2018 Lecture Note 3: The Perceptron

[3] IIT Bombay CS344/386 Lecture Note: The Perceptron Learning Algorithm and its Convergence