From dividing line to Support Vector Machines in Python

A simple dividing line is used to make predictions on simple 2D data where we have a dependent variable and an independent variable.

A line can be drawn according to the relation between those and following the trend of the line we can make predictions.

For example-

Here we have taken an independent dummy variable X and a dependent variable Y that takes the squared values of X.

import matplotlib.pyplot as plt
X=[1,3,5,7,8]
Y=[1,9,25,49,64]
plt.scatter(X,Y)
plt.plot(X,Y)
plt.show

Now plotting this we get a

dividing line

This is obviously a brute-force method. Also, most data that we encounter in our daily lives are not simple numerical values and are multidimensional.

So we need a better method to make predictions. Some of them are linear regression and logistic regression.

These algorithms learn a function and using that function it will make predictions, also called hypothesis.

Linear regression assumes a linear relationship between the dependent and the independent variable, while logistic regression uses something like a sigmoid function to make predictions.

Support Vector Machine

A better way to classify data is by using Support Vector Machines. SVM is a very powerful way of classifying data.

SVM works well with multidimensional data. It generates a hyperplane that distinguishes between different categories.

Here we use a hyperplane because multidimensional data cannot be distinguished using a simple line.

We need a plane to do that.

There are certain parameters that give us a better understanding of SVMs

1)REGULARISATION FACTOR-It tells us how much miss classification can we tolerate.

2)GAMMA-this tells us how far is the influence of any point in the calculations of distance. Low values indicate far away points also get a say in the calculation and vice versa.

3)KERNEL-The learning of the hyperplane is done using some algebraic equations for example equation of a line as in the case of linear regression or sigmoid function in the case of logistic regression.

In SVMs we can choose these governing equations by choosing different kernels like the linear kernel or Gaussian kernel, many more are also available.

Now we will train a simple SVM model od iris dataset available in sklearn, to get a better understanding.

First import all the necessary modules

from sklearn import  svm,datasets
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Now let’s load the data and train the model

As you can see iris dataset is a multidimensional dataset.

print(X)
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],

 

iris=datasets.load_iris()
X=iris.data
Y=iris.target
X_train,X_test,Y_train,Y_test=train_test_split(X,Y)
#get an object of the model
clf=svm.SVC()
clf.fit(X_train,Y_train)

Now let’s get the predictions

predict=clf.predict(X_test)
predict


OUTPUT
array([1, 2, 1, 1, 0, 0, 1, 0, 2, 1, 2, 2, 0, 0, 0, 0, 2, 2, 0, 0, 2, 1,
       1, 2, 2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 2, 2, 2])

Congratulations your SVM model is ready!!!

In the below portion, we will classify the dataset by using linear classifier(logistic classifier) and non-linear classifier(support vector machines). We will generate our own dataset from normal distribution to avoid the occurrence of any pattern in generated points.

Creating dataset

import numpy as np
import matplotlib.pyplot as plt

data = [[np.random.rand(), np.random.rand()] for i in range(10)]
for i, point in enumerate(data):
  x, y = point
  if 0.5*x - y + 0.25 > 0:
    data[i].append(-1)
  else:
    data[i].append(1)

for x, y, l in data:
  if l == 1: 
    clr = 'red'
  else: 
    clr = 'blue'
  plt.scatter(x, y, c=clr)
  plt.xlim(0,1)
  plt.ylim(0,1)

Output :

create dataset in SVM

Applying logistic regression

from sklearn.linear_model import LogisticRegression


data = np.asarray(data)
X = data[:,:2]
Y = data[:,2]
print(X[2,0])

clf = LogisticRegression().fit(X,Y)

pred_ytrain=clf.predict(X)
print(pred_ytrain)

X1space = np.linspace(0, 1, 100)
X2space = np.linspace(0, 1, 100)

  # Creating the array of testing points 
Xtest = []
for i in range(100):
    X1space = np.linspace(0, 1, 100)
    X2space = np.linspace(0, 1, 100)

  # Creating the array of testing points 
Xtest = []
for i in range(100):
    for j in range(100):        
        Xtest.append([X1space[i],X2space[j]])
Xtest = np.asarray(Xtest)
    

pred_ytrain=clf.predict(Xtest)
plt.xlim(0,1)
for i in range(len(pred_ytrain)):
    if(pred_ytrain[i]==1):
        clr='red'
    if(pred_ytrain[i]==-1):
        clr='blue'
    plt.scatter(Xtest[i,0],Xtest[i,1],c=clr)

Output:

Applying logistic regression

Visualizing original and logistic boundary

x_points = np.linspace(0,1,1000)

# Defining and plotting the predicted line 
y_pred = -(clf.coef_[0][0]*x_points + clf.intercept_[0])/clf.coef_[0][1]
plt.xlim(0,1)
plt.ylim(0,1)
plt.plot(x_points, y_pred,'-r', label = 'Pred line')


y_act = (0.5*x_points + 0.25)
plt.plot(x_points, y_act, '-b', label = 'Act line')
plt.legend()

Output :

Visualizing original and logistic boundary

SVM function from scratch

def svm_function(x, y, epoch, l_rate):

    # Appending a fixed bias of +1 to all the datapoints
    list_X = x.tolist()
    for i in range(len(x)):
      list_X[i].append(1)
    x = np.asarray(list_X)
    
    # Initializing the weight vector with zeros 
    w = np.zeros(len(x[0]))
    print(w.shape,x[1].shape)

    # Updating the weight vector 
    for val in range(1,epoch):
        for i, point in enumerate(x):
            if (y[i]*np.dot(x[i], w)) < 1:
                w = w + l_rate * ((x[i]*y[i]) + (-2*(1/epoch)* w))
            else:
                w = w + l_rate * (-2*(1/epoch)* w)
    return w

Train the dataset

data = np.asarray(data)
X = data[:,:2]
Y = data[:,2]
w = svm_function(X, Y, 10000, 1)

# plotting SVM vs original bounary
x_points = np.linspace(0,1,1000)

 
y_pred = (-w[0]*x_points - w[2])/w[1]
plt.plot(x_points, y_pred,'-r', label = 'Pred line')

# Defining and plotting the actual line 
y_act = (0.5*x_points + 0.25)
plt.plot(x_points, y_act, '-b', label = 'Act line')
plt.legend()

Output:

Train the dataset

Visualizing Decision boundary

X1=X[:,0]
X2=X[:,1]


def mapping(X1, X2):    
  x = np.c_[(X1, X2)]
  #x_1 = x[:,0]**2        
  #x_2 = np.sqrt(2)*x[:,0]*x[:,1]        
  #x_3 = x[:,1]**2
  features_x = np.array([X1, X2])
  return features_x

features_x = mapping(X1, X2)


# Calling the previously defined SVM function 
w =  svm_function(features_x.T, Y, 10000, 1) 

The mapping function which is defined above is a  no use here since we know that boundary is linear. But for nonlinear boundaries, the mapping function is of at most importance. To explain to you how to apply it for non-linear boundary, I have created the dataset link which will be given at the end of the session.

Visualizing the SVM boundary

def vis_dec_bnd():
   
  X1space = np.linspace(np.min(X1), np.max(X1), 100)
  X2space = np.linspace(np.min(X2), np.max(X2), 100)

 
  Xtest = []
  for i in range(100):
    for j in range(100):        
      Xtest.append([X1space[i],X2space[j]])
  Xtest = np.asarray(Xtest)

  
  features_x = mapping(Xtest[:,0], Xtest[:,1]).T

  # Adding a column of all ones to match with the dimension of weight vector  
  features_x = np.c_[features_x,np.ones(len(features_x))]

  # Classifying the points with trained SVM function 
  for i in range(len(features_x)):
    if np.dot(features_x[i], w) > 0:
      plt.scatter(Xtest[i,0], Xtest[i,1], c = "r")
    else:
      plt.scatter(Xtest[i,0], Xtest[i,1], c = "b")
  return None


vis_dec_bnd()

Output:

Now using the same functions defined above we will try to visualize the non linear boundary.

 Nonlinear dataset

import csv
with open('https://raw.githubusercontent.com/premssr/SVM-nonlinear-dataset/master/Csv1.csv', 'r') as f:
  values_csv2 = list(csv.reader(f, delimiter=','))


for x, y, l in values_csv2:
    if (int(l) == 1): 
      clr = 'red'
    else: 
      clr = 'blue'
    plt.scatter(float(x), float(y), c=clr)

Training and Visualization of non-linear dataset

X = []
Y = []

for row in values_csv2:
  X.append(row[:-1])
  Y.append(row[-1])

X = np.array(X).astype('float64')
Y = np.array(Y).astype('float64')

X1 = X[:,0]
X2 = X[:,1]

def mapping(X1, X2):    
  x = np.c_[(X1, X2)]
  x_1 = x[:,0]        
  x_2 = pow(x[:,0], 2)/0.4 + pow(x[:,1], 2)/0.8        
  x_3 = x[:,1]
  features_x = np.array([X1, X2, x_1, x_2, x_3])
  return features_x

features_x = mapping(X1, X2)

 
w = svm_function(features_x.T, Y, 10000, 1) 
print(w)


vis_dec_bnd()

Here is the link for the complete code and dataset https://github.com/premssr/SVM-nonlinear-dataset
That’s it guys we have applied two different algorithms on the same dataset and also observed how SVM is efficient in classifying nonlinear dataset. Please do comment if you could come up with a more efficient way to visualize the power of SVM.

Leave a Reply

Your email address will not be published. Required fields are marked *