Data Analysis for multidimensional data in Python

In this article, we will explore the sequential steps needed to perform while handling the multidimensional data to use it in Machine Learning Algorithm with Python code implementation.

There are many issues to be faced while handling Multidimensional data like missing data, collinearity, multicollinearity, categorical attributes etc. Let see how to deal with each one of them.

Link to the dataset and code will be provided at the end of the article.

Data Analysis

Import Data

import pandas as pd
sheet=pd.read_csv("https://raw.githubusercontent.com/premssr/Steps-in-Data-analysis-of-Mutidimensional-data/master/Train_before.csv")
sheet.head()

Output:

Understanding Data

sheet.describe(include='all')

Output:

There are some numerical and some categorical predictors in this data. Salary column is the one we need to predict we first convert the column into variables 0 or 1. This thing has been done as the first step of data analysis in our CSV file itself. Now the given data has some missing.

Divide the predictors and response

pdytrain=sheet['salary']
pdxtrain=sheet.drop('salary',axis=1)
pdxtrain.head()

Output :

Generally, when we collect data in practice there are some missing values. This might be attributed to the negligence of volunteer who is collecting data for us or to miss the inefficient design of the experiment. Whatever the reason is we The Data Analyst have to cope up with it. There are quite a few methods to handle it. If we have enough data that removal of the data points won’t affect our model then we go for it. Otherwise, we replace the missing value with appropriate value mean, median or mode of the attribute. This method is called Imputation. We will replace the missing value with most frequent (mode) in the case of discrete attributes and with mean in case of continuous attributes.

Count number of missing data from each attribute

pdxtrain.isnull().sum()

Output:

Imputation

from sklearn.impute import SimpleImputer

npxtrain=np.array(pdxtrain)
npytrain=np.array(pdytrain)

#for categories
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp.fit(npxtrain[:,[1,2,4,5,6,7]])
pred_categ=imp.transform(npxtrain[:,[1,2,4,5,6,7]])


#for continuos

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(npxtrain[:,[0,3,8,9,10]])
pred_int=imp.transform(npxtrain[:,[0,3,8,9,10]])

npimputedxtrain=np.c_[pred_categ,pred_int]
pdimputedxtrain=pd.DataFrame(npimputedxtrain)

pdimputedxtrain.columns =['workclass', 'education','marital status','occupation','relationship','sex','Age','education-num','capital-gain',
             'capital loss','hours-per-week']

pdimputedxtrain.describe(include='all')

Output:

Now once we have whole set of data. We will now convert discrete data to a binary value of 0 or 1. This is called One Hot Encoding. But for categorical data we fir label encode them that is replace categories with numbers then go for one hot encoding.

Lebel Encoding

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder() 
  
pdimputedxtrain['workclass']= le.fit_transform(pdimputedxtrain['workclass']) 
pdimputedxtrain['education']= le.fit_transform(pdimputedxtrain['education'])
pdimputedxtrain['marital status']= le.fit_transform(pdimputedxtrain['marital status'])
pdimputedxtrain['occupation']= le.fit_transform(pdimputedxtrain['occupation'])
pdimputedxtrain['relationship']= le.fit_transform(pdimputedxtrain['relationship'])
pdimputedxtrain['sex']= le.fit_transform(pdimputedxtrain['sex'])




pdimputedxtrain=pdimputedxtrain.drop(['education'],axis=1)
print(pdimputedxtrain.head())




pdOneHotencoded.columns =['Federal-gov', 'Local-gov', 'Private', 'Self-emp-not-inc','State-gov','Self-emp-inc','Without-pay','Married-AF-        spouse','Married-civ-spouse','Married-spouse-absent','Divorced','Never-married','Separated','Widowed','cater','Adm-clerical',' Armed-Forces',' Exec-managerial','Farming-fishing','Handlers-cleaners','Machine-op-inspct','Other-service','Priv-house-serv',' Prof-specialty','Protective-serv','Sales',' Tech-support','Transport-moving','Husband','Not-in-family','Other-relative','Own-child','Unmarried','Wife','Female','Male','Age','education-num','capital-gain','capital-loss', 'hours-per-week','salary']




Output:

Onehotencoding

onehotencoder = OneHotEncoder(categorical_features = [0,1,2,3,4])
npOneHotencoded = onehotencoder.fit_transform(pdimputedxtrain).toarray()
pdOneHotencoded=pd.DataFrame(npOneHotencoded)
pdOneHotencoded.describe()

Output:

Based on the observation from the above table. A very small mean value of indicates that particular attribute is a very small infraction of other attributes, So chose to omit that attribute. This can also be observed from the histogram as below.

Histogram 

pdimputedxtrain.hist(figsize=(8,8))

Output :

Delete the attributes

del pdOneHotencoded['Without-pay']
del pdOneHotencoded['Married-AF-spouse']
del pdOneHotencoded['Married-spouse-absent']
del pdOneHotencoded[' Armed-Forces']
del pdOneHotencoded['Priv-house-serv']
del pdOneHotencoded['Wife']
del pdOneHotencoded['Other-relative']
del pdOneHotencoded['Widowed']
del pdOneHotencoded['Separated']
del pdOneHotencoded['Federal-gov']
del pdOneHotencoded['Married-civ-spouse']
del pdOneHotencoded['Local-gov']
del pdOneHotencoded['Adm-clerical']

Now we have a complete dataset which we can use to train a model. Though there are many models we can fit. Let us go for Logistic regression and learn how to analyse the result.

Fit Logistic Model

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
xtrain=pdOneHotencoded.drop(['salary'],axis=1)
ytrain=pdOneHotencoded['salary']
clf = LogisticRegression(random_state=0).fit(xtrain, ytrain)
pred_ytrain=clf.predict(xtrain)
accuracy_score(ytrain,pred_ytrain)

Output:

        0.7608

 

Plot Confusion Matrix

from sklearn.metrics import confusion_matrix




confusion_matrix(ytrain,pred_ytrain).ravel()


cfm = confusion_matrix(pred_ytrain,ytrain)
sns.heatmap(cfm, annot=True)
plt.xlabel('Predicted classes')
plt.ylabel('Actual classes')

Output:

Confusion matric heatmap

Plot ROC

from sklearn.metrics import roc_curve, auc 

pred_test_log_prob=clf.predict_proba(xtrain)

fpr,tpr,_= roc_curve(ytrain,pred_test_log_prob[:,1])
roc_auc=auc(fpr,tpr)
print('area under the curve',roc_auc)
print('Accuracy',accuracy_score(ytrain,pred_ytrain))
plt.plot(fpr,tpr,label='ROC curve(area=%0.2f)' %roc_auc)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating charecteristic')
plt.legend(loc="lower right")
plt.show()

Output:

As we see our model is not performing well. Accuracy is just 0.76. Now we need to debug this. First anf foremost thing to check if there is any collinearity between the attributes with is disturbing the model

Collinearity Heat Map

corr=pdOneHotencoded[['Age','education-num','capital-gain','capital-loss','hours-per-week','salary']].corr(method='pearson')
print(corr)
#print(cor_df.corr(method='pearson').style.background_gradient(cmap='coolwarm'))
plt.figure(figsize=(10, 10))
plt.imshow(corr, cmap='coolwarm', interpolation='none', aspect='auto')
plt.colorbar()
plt.xticks(range(len(corr)), corr.columns, rotation='vertical')
plt.yticks(range(len(corr)), corr.columns);
plt.suptitle('NUmeric Data Correlations Heat Map', fontsize=15, fontweight='bold')
plt.show()

Output:

It seems there isn’t any correlation. There’s one more thing that needs to be checked Variation Inflation Factor.

Calculating VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
Cont= pd.DataFrame()
cont=pdOneHotencoded[['Age','education-num','capital-loss','hours-per-week','capital-gain']]
vif["VIF Factor"] = [variance_inflation_factor(cont.values, i) for i in range(cont.shape[1])]
vif["features"] = cont.columns

print(vif)

Output:

VIF should be as low as possible. typically more than 10 is not acceptable.

Deleting Attributes with High VIF.

del pdOneHotencoded['Age']
del pdOneHotencoded['education-num']
del pdOneHotencoded['capital-loss']
del pdOneHotencoded['hours-per-week']
del pdOneHotencoded['capital-gain']

That’s it guys we have covered all the necessary steps required in basic data analysis of multidimensional data. Using these steps in the same sequence most types of data can be analyzed and the necessary inside can be developed.

Link to dataset and full code here

Leave a Reply

Your email address will not be published. Required fields are marked *