Predict the Heart Disease Using SVM using Python

In this tutorial, we will be predicting heart disease by training on a Kaggle Dataset using machine learning (Support Vector Machine) in Python.

We aim to classify the heartbeats extracted from an ECG using machine learning, based only on the lineshape (morphology) of the individual heartbeats. To achieve this, we will have to import various modules in Python. We will be using Visual studio code for execution. In this dataset, the single heartbeats from the ECG were extracted using the Pam-Tompkins algorithm.

There are two files for datasets one is for signals from ECG and the other is for the type of heart disease. Those can be downloaded from these two links Signals and DS1_labels

These labels represent a heartbeat type

  • 0 = Normal
  • 1 = Supraventricular ectopic beat
  • 2 = Ventricular ectopic beat
  • 3 = Fusion Beat

Install the modules given below by using “pip install (module name)”

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import LinearSVC, SVC
import seaborn as sn
import pandas as pd

Reading dataset from the System using read_csv and mention the location of the dataset.

signals = pd.read_csv("C:\\Users\\monis\\Downloads\\DS1_signals.csv", header=None)
labels = pd.read_csv("C:\\Users\\monis\\Downloads\DS1_labels.csv", header=None)

Dataset details:

print("*"*50)
print("Signals Info:")
print("*"*50)
print(signals.info())
print("*"*50)
print("Labels Info:")
print("*"*50)
print(labels.info())
print("*"*50)
signals.head()

dataset_name.info() is a function that gives a basic description of the dataset like no. of columns, rows, type of entries, memory. the head() function gives the first 5 rows of the dataset.

Output:

**************************************************
Signals Info:
**************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51002 entries, 0 to 51001
Columns: 180 entries, 0 to 179
dtypes: float64(180)
memory usage: 70.0 MB
None
**************************************************
Labels Info:
**************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51002 entries, 0 to 51001
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 51002 non-null int64
dtypes: int64(1)
memory usage: 398.5 KB
None
**************************************************
0 1 2 3 4 5 6 7 8 9 ... 170 171 172 173 174 175 176 177 178 179
0 0.96582 0.96777 0.96729 0.96826 0.96973 0.96680 0.96533 0.96729 0.96875 0.97021 ... 0.97070 0.97314 0.97510 0.97656 0.97510 0.97607 0.97705 0.97852 0.97949 0.97949
1 0.97412 0.97314 0.97363 0.97314 0.97314 0.97314 0.97461 0.97412 0.97314 0.97217 ... 0.97070 0.97168 0.97119 0.97266 0.97510 0.97705 0.97607 0.97607 0.97705 0.97803
2 0.96240 0.96289 0.96484 0.96631 0.96631 0.96436 0.96338 0.96240 0.96533 0.96582 ... 0.95996 0.96094 0.96143 0.95996 0.96094 0.96289 0.96533 0.96533 0.96338 0.96533
3 0.95898 0.95996 0.96094 0.96045 0.95898 0.95898 0.95801 0.95947 0.96094 0.95996 ... 0.96338 0.96289 0.96387 0.96387 0.96289 0.96387 0.96533 0.96631 0.96533 0.96631
4 0.96973 0.97070 0.96875 0.96875 0.96777 0.96826 0.96973 0.96875 0.96924 0.96924 ... 0.95166 0.95264 0.95410 0.95605 0.95703 0.95703 0.95605 0.95459 0.95557 0.95654

Data Analysing and Data Preprocessing: Predict the Heart Disease Using SVM

Now we will check for missing data in the dataset

print("Column Number of NaN's")
for col in signals.columns:
    if signals[col].isnull().sum() > 0:
        print(col, signals[col].isnull().sum())

IsNull() gives the 1 if there are any null values or empty values in the selected component.

Output:

Column Number of NaN's

This means that our dataset doesn’t contain any null values. If there is any, then it will show the number of columns that has null values.

In our Dataset, responses and signals(variables or predictors) are in two different files. So, We have to combine it.

joined_data = signals.join(labels, rsuffix="_signals", lsuffix="_labels")
joined_data.columns = [i for i in range(180)]+['class']

The first line join() joins labels with signals. Second-line labels the response of the joined data to ‘class’

Now We will find the correlation between the features and plot the first four highly correlated features.

cor_mat=joined_data.corr()
print('*'*50)
print('Top 10 high positively correlated features')
print('*'*50)
print(cor_mat['class'].sort_values(ascending=False).head(10))
print('*'*50)
print('Top 10 high negatively correlated features')
print('*'*50)
print(cor_mat['class'].sort_values().head(10))
%matplotlib inline
from pandas.plotting import scatter_matrix
features = [79,80,78,77]
scatter_matrix(joined_data[features], figsize=(20,15), c =joined_data['class'], alpha=0.5);

data.corr() gives the correlation result of each row. and we are sorting it to find the first 10 highest and lowest correlation values. Sorting function you know is used to sort the values.

Output:

**************************************************
Top 10 high positively correlated features
**************************************************
class 1.000000
79 0.322446
80 0.320138
78 0.318702
77 0.311504
81 0.310178
76 0.302628
82 0.292991
75 0.291687
98 0.285491
Name: class, dtype: float64
**************************************************
Top 10 high negatively correlated features
**************************************************
153 -0.090500
154 -0.090206
152 -0.089958
155 -0.089625
156 -0.089017
157 -0.088890
151 -0.088853
158 -0.088647
150 -0.087771
159 -0.087768
Name: class, dtype: float64

see the graph from the given link.

Predict the Heart Disease Using SVM

From the graph, we can see that correlation among those features(79,80,78,77) is strongly linear. So, we will find the probability of each class to find whether the data is balanced or unbalanced.

print('-'*20)
print('Class\t %')
print('-'*20)
print(joined_data['class'].value_counts()/len(joined_data))
joined_data.hist('class');
print('-'*20)

values_counts() count he each variable in the column. we are dividing that result by the number of rows. so that you can get the probability of each class.

Output:

--------------------
Class %
--------------------
0 0.898475
2 0.074272
1 0.019137
3 0.008117
Name: class, dtype: float64
--------------------

So, We see that our data is quite unbalanced. only we data is for class 1,2,3 and 90% of data falls under class 0.

Our data has no missing values. So, We can start the algorithm.

Resampling

Now we are going to train 80% of the dataset for training and 20% for testing.

from sklearn.model_selection import StratifiedShuffleSplit
split1 = StratifiedShuffleSplit(n_splits=1, test_size=0.2,random_state=42)
for train_index, test_index in split1.split(joined_data, joined_data['class']):
    train_set = joined_data.loc[train_index]
    test_set = joined_data.loc[test_index]

StratifiedShuffleSplit provides train/test indices to split data into train/test sets. we are defining test_size =0.2(20%). The previous one is like defining test and train indices. The next one for loop assigns train_set and test_set. split() generates indices for train and set data and we are assigning it to train_index and test_index. Those data indices are assigned to train_set and test_set.

features_train = strat_train_set.drop('class', 1)
labels_train = strat_train_set['class']

This provides produce labels and features sets for the training stage.

Support Vactor Machine

let us choose our parameters C(Soft margin cost) and gamma values and then tune it to find the best. before doing we are going to standardize the data. the purpose of standardizing is that to overcome the problem of outliers and leverage points. For that, we are using StandardScaler().

scaler = StandardScaler()
std_features = scaler.fit_transform(strat_features_train)
svc_param_grid = {'C':[10], 'gamma':[0.1,1,10]}

initialize the classifier

svc = SVC(kernel='rbf',decision_function_shape='ovo',random_state=42, max_iter = 500)

the kernel is the type of kernel used. we are using RBF. We are defining decision_function_shape one vs one. Now we are going to find the best parameters among the chosen one.

svc_grid_search = GridSearchCV(svc, svc_param_grid, cv=3, scoring="f1_macro")

we are selecting it based on the f1 score. The f1 score can be interpreted as a weighted average of the precision and where an F1 score reaches its best value at 1 and the worst score at 0. It is an accuracy percentage.

svc_grid_search.fit(std_features, labels_train)

we have fitted the train set in the svc with the best parameters.

Output:

GridSearchCV(cv=3, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovo', degree=3,
                           gamma='scale', kernel='rbf', max_iter=500,
                           probability=False, random_state=42, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [10], 'gamma': [0.1, 1, 10]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1_macro', verbose=0)
train_accuracy=svc_grid_search.best_score_
print('Model\t\tBest params\t\tBest score')
print("-"*50)
print("SVC\t\t", svc_grid_search.best_params_, train_accuracy)

Output:

Model		Best params		Best score
--------------------------------------------------
SVC		 {'C': 10, 'gamma': 0.1} 0.9104871061578681

Now for testing Set

features_test = test_set.drop('class', 1)
labels_test = test_set['class']
std_features = scaler.fit_transform(features_test)
svc_grid_search.fit(std_features, labels_test)
test_accuracy=svc_grid_search.best_score
print('Model\t\tBest params\t\tBest score')
print("-"*50)
print("SVC\t\t", svc_grid_search.best_params_, test_accuracy)

Output:

Model		Best params		Best score
--------------------------------------------------
SVC		 {'C': 10, 'gamma': 0.1} 0.8343809959585644

Conclusion:

print("Train Accuracy : "+str(train_accuracy))
print("Test Accuracy  : "+str(test_accuracy))

Output:

Train Accuracy : 0.9104871061578681
Test Accuracy  : 0.8343809959585644

Also read: Machine Learning Model to predict Bitcoin Price in Python

Leave a Reply

Your email address will not be published. Required fields are marked *