Predict the Heart Disease Using SVM using Python
In this tutorial, we will be predicting heart disease by training on a Kaggle Dataset using machine learning (Support Vector Machine) in Python.
We aim to classify the heartbeats extracted from an ECG using machine learning, based only on the lineshape (morphology) of the individual heartbeats. To achieve this, we will have to import various modules in Python. We will be using Visual studio code for execution. In this dataset, the single heartbeats from the ECG were extracted using the Pam-Tompkins algorithm.
There are two files for datasets one is for signals from ECG and the other is for the type of heart disease. Those can be downloaded from these two links Signals and DS1_labels
These labels represent a heartbeat type
- 0 = Normal
- 1 = Supraventricular ectopic beat
- 2 = Ventricular ectopic beat
- 3 = Fusion Beat
Install the modules given below by using “pip install (module name)”
import numpy as np import matplotlib.pyplot as plt from sklearn.svm import LinearSVC, SVC import seaborn as sn import pandas as pd
Reading dataset from the System using read_csv and mention the location of the dataset.
signals = pd.read_csv("C:\\Users\\monis\\Downloads\\DS1_signals.csv", header=None) labels = pd.read_csv("C:\\Users\\monis\\Downloads\DS1_labels.csv", header=None)
Dataset details:
print("*"*50) print("Signals Info:") print("*"*50) print(signals.info()) print("*"*50) print("Labels Info:") print("*"*50) print(labels.info()) print("*"*50) signals.head()
dataset_name.info() is a function that gives a basic description of the dataset like no. of columns, rows, type of entries, memory. the head() function gives the first 5 rows of the dataset.
Output:
************************************************** Signals Info: ************************************************** <class 'pandas.core.frame.DataFrame'> RangeIndex: 51002 entries, 0 to 51001 Columns: 180 entries, 0 to 179 dtypes: float64(180) memory usage: 70.0 MB None ************************************************** Labels Info: ************************************************** <class 'pandas.core.frame.DataFrame'> RangeIndex: 51002 entries, 0 to 51001 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 0 51002 non-null int64 dtypes: int64(1) memory usage: 398.5 KB None ************************************************** 0 1 2 3 4 5 6 7 8 9 ... 170 171 172 173 174 175 176 177 178 179 0 0.96582 0.96777 0.96729 0.96826 0.96973 0.96680 0.96533 0.96729 0.96875 0.97021 ... 0.97070 0.97314 0.97510 0.97656 0.97510 0.97607 0.97705 0.97852 0.97949 0.97949 1 0.97412 0.97314 0.97363 0.97314 0.97314 0.97314 0.97461 0.97412 0.97314 0.97217 ... 0.97070 0.97168 0.97119 0.97266 0.97510 0.97705 0.97607 0.97607 0.97705 0.97803 2 0.96240 0.96289 0.96484 0.96631 0.96631 0.96436 0.96338 0.96240 0.96533 0.96582 ... 0.95996 0.96094 0.96143 0.95996 0.96094 0.96289 0.96533 0.96533 0.96338 0.96533 3 0.95898 0.95996 0.96094 0.96045 0.95898 0.95898 0.95801 0.95947 0.96094 0.95996 ... 0.96338 0.96289 0.96387 0.96387 0.96289 0.96387 0.96533 0.96631 0.96533 0.96631 4 0.96973 0.97070 0.96875 0.96875 0.96777 0.96826 0.96973 0.96875 0.96924 0.96924 ... 0.95166 0.95264 0.95410 0.95605 0.95703 0.95703 0.95605 0.95459 0.95557 0.95654
Data Analysing and Data Preprocessing: Predict the Heart Disease Using SVM
Now we will check for missing data in the dataset
print("Column Number of NaN's") for col in signals.columns: if signals[col].isnull().sum() > 0: print(col, signals[col].isnull().sum())
IsNull() gives the 1 if there are any null values or empty values in the selected component.
Output:
Column Number of NaN's
This means that our dataset doesn’t contain any null values. If there is any, then it will show the number of columns that has null values.
In our Dataset, responses and signals(variables or predictors) are in two different files. So, We have to combine it.
joined_data = signals.join(labels, rsuffix="_signals", lsuffix="_labels") joined_data.columns = [i for i in range(180)]+['class']
The first line join() joins labels with signals. Second-line labels the response of the joined data to ‘class’
Now We will find the correlation between the features and plot the first four highly correlated features.
cor_mat=joined_data.corr() print('*'*50) print('Top 10 high positively correlated features') print('*'*50) print(cor_mat['class'].sort_values(ascending=False).head(10)) print('*'*50) print('Top 10 high negatively correlated features') print('*'*50) print(cor_mat['class'].sort_values().head(10)) %matplotlib inline from pandas.plotting import scatter_matrix features = [79,80,78,77] scatter_matrix(joined_data[features], figsize=(20,15), c =joined_data['class'], alpha=0.5);
data.corr() gives the correlation result of each row. and we are sorting it to find the first 10 highest and lowest correlation values. Sorting function you know is used to sort the values.
Output:
************************************************** Top 10 high positively correlated features ************************************************** class 1.000000 79 0.322446 80 0.320138 78 0.318702 77 0.311504 81 0.310178 76 0.302628 82 0.292991 75 0.291687 98 0.285491 Name: class, dtype: float64 ************************************************** Top 10 high negatively correlated features ************************************************** 153 -0.090500 154 -0.090206 152 -0.089958 155 -0.089625 156 -0.089017 157 -0.088890 151 -0.088853 158 -0.088647 150 -0.087771 159 -0.087768 Name: class, dtype: float64
see the graph from the given link.
From the graph, we can see that correlation among those features(79,80,78,77) is strongly linear. So, we will find the probability of each class to find whether the data is balanced or unbalanced.
print('-'*20) print('Class\t %') print('-'*20) print(joined_data['class'].value_counts()/len(joined_data)) joined_data.hist('class'); print('-'*20)
values_counts() count he each variable in the column. we are dividing that result by the number of rows. so that you can get the probability of each class.
Output:
-------------------- Class % -------------------- 0 0.898475 2 0.074272 1 0.019137 3 0.008117 Name: class, dtype: float64 --------------------
So, We see that our data is quite unbalanced. only we data is for class 1,2,3 and 90% of data falls under class 0.
Our data has no missing values. So, We can start the algorithm.
Resampling
Now we are going to train 80% of the dataset for training and 20% for testing.
from sklearn.model_selection import StratifiedShuffleSplit split1 = StratifiedShuffleSplit(n_splits=1, test_size=0.2,random_state=42) for train_index, test_index in split1.split(joined_data, joined_data['class']): train_set = joined_data.loc[train_index] test_set = joined_data.loc[test_index]
StratifiedShuffleSplit provides train/test indices to split data into train/test sets. we are defining test_size =0.2(20%). The previous one is like defining test and train indices. The next one for loop assigns train_set and test_set. split() generates indices for train and set data and we are assigning it to train_index and test_index. Those data indices are assigned to train_set and test_set.
features_train = strat_train_set.drop('class', 1) labels_train = strat_train_set['class']
This provides produce labels and features sets for the training stage.
Support Vactor Machine
let us choose our parameters C(Soft margin cost) and gamma values and then tune it to find the best. before doing we are going to standardize the data. the purpose of standardizing is that to overcome the problem of outliers and leverage points. For that, we are using StandardScaler().
scaler = StandardScaler() std_features = scaler.fit_transform(strat_features_train) svc_param_grid = {'C':[10], 'gamma':[0.1,1,10]}
initialize the classifier
svc = SVC(kernel='rbf',decision_function_shape='ovo',random_state=42, max_iter = 500)
the kernel is the type of kernel used. we are using RBF. We are defining decision_function_shape one vs one. Now we are going to find the best parameters among the chosen one.
svc_grid_search = GridSearchCV(svc, svc_param_grid, cv=3, scoring="f1_macro")
we are selecting it based on the f1 score. The f1 score can be interpreted as a weighted average of the precision and where an F1 score reaches its best value at 1 and the worst score at 0. It is an accuracy percentage.
svc_grid_search.fit(std_features, labels_train)
we have fitted the train set in the svc with the best parameters.
Output:
GridSearchCV(cv=3, error_score=nan, estimator=SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovo', degree=3, gamma='scale', kernel='rbf', max_iter=500, probability=False, random_state=42, shrinking=True, tol=0.001, verbose=False), iid='deprecated', n_jobs=None, param_grid={'C': [10], 'gamma': [0.1, 1, 10]}, pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring='f1_macro', verbose=0)
train_accuracy=svc_grid_search.best_score_ print('Model\t\tBest params\t\tBest score') print("-"*50) print("SVC\t\t", svc_grid_search.best_params_, train_accuracy)
Output:
Model Best params Best score -------------------------------------------------- SVC {'C': 10, 'gamma': 0.1} 0.9104871061578681
Now for testing Set
features_test = test_set.drop('class', 1) labels_test = test_set['class'] std_features = scaler.fit_transform(features_test) svc_grid_search.fit(std_features, labels_test) test_accuracy=svc_grid_search.best_score print('Model\t\tBest params\t\tBest score') print("-"*50) print("SVC\t\t", svc_grid_search.best_params_, test_accuracy)
Output:
Model Best params Best score -------------------------------------------------- SVC {'C': 10, 'gamma': 0.1} 0.8343809959585644
Conclusion:
print("Train Accuracy : "+str(train_accuracy)) print("Test Accuracy : "+str(test_accuracy))
Output:
Train Accuracy : 0.9104871061578681 Test Accuracy : 0.8343809959585644
Also read: Machine Learning Model to predict Bitcoin Price in Python
Leave a Reply