Predictive Models for Ebola using Machine Learning in Python

In this article, you will learn Predictive Models for Ebola using Machine Learning in Python. The Ebola virus spread according to the region and respective climate of countries. On March 23, 2014, the World Health Organization (WHO) announced that an Ebola virus revolution that began in the Republic of Guinea in December 2013 had spread to several West African countries. Preliminary data from the 2014 EVD epidemic suggest that in all cases, approximately 20% of children were less than 16 years of age, in which case mortality was 90% among children less than  1-year-old.

For the dataset, visit:kaggle/ebola-outbreak

For the CSV file, visit: Ebola dataset CSV file

Image of EBOLA  virus: Ebola Virus Image

Import the libraries,

import requests 
import numpy as np
import pandas as pd

from sklearn import datasets, linear_model, preprocessing

Get the data,

names = ['Indicator','Country','Date','value']
df = pd.read_csv('https://ds-ec2.scraperwiki.com/g7nnqgn/ckm9nsfssakeuor/cgi-bin/csv/ebola_data_db_format.csv',names=names,na_values='?',header=None,skiprows=1)
df = df.dropna()
df.head(5)
df.head()

Predictive Models for Ebola using Machine Learning in Python

df.shape()

 

 

(17585, 4)
df['Indicator'].value_counts()
df['Country'].unique()

Output:-

array(['Guinea', 'Liberia', 'Sierra Leone', 'United Kingdom', 'Mali',
       'Nigeria', 'Senegal', 'Spain', 'United States of America', 'Italy',
       'Liberia 2', 'Guinea 2'], dtype=object)

T_Guinea=[73.0,84.0,78.5]
T_Liberia=[81.0,81.0,81.0]
T_Sierra_Leone=[78.6,83.3,80.95]
T_United_Kingdom=[40.0,60.0,50.0]
T_Mali=[77.2,90.3,83.75]
T_Nigeria=[78.8,82.4,80.6]
T_Senegal=[70.0,82.0,76.0]
T_Spain=[46.0,73.0,59.5]
T_United_States_of_America=[33.2,71.9,52.55]
T_Italy=[46.0,78.0,62.0]
T = [T_Guinea, T_Italy, T_Liberia, T_Mali, T_Nigeria, T_Senegal,
     T_Sierra_Leone, T_Spain, T_United_Kingdom, T_United_States_of_America]
print(T[0][0])

O/P:-

73.0

Split the data frame, choose first 0~9947 rows as clean data

ntotal = 9947
ntr = 8000
nts = ntotal - ntr
df_tot = df.loc[0:ntotal,:]

Preprocessing

x_df = df_tot.loc[df['Indicator'] == 'Cumulative number of confirmed, probable and suspected Ebola cases', ['Country', 'Date', 'value']]

 

x_df_country = x_df.take([0], axis=1)
x_df_date = x_df.take([1], axis=1)

x_country = x_df_country.values
x_date = x_df_date.values
x_date = x_date.ravel()
x_days = pd.to_datetime(x_date, format= "%Y-%m-%d").dayofyear.values
x_country = x_country.ravel()

Now extract the confirmed Ebola cases:-

ls = []
for i in range(x_date.shape[0]):
    ls.append(df_tot.loc[(df['Date'] == x_date[i]) & 
                        (df['Country'] == x_country[i]) & 
                        (df['Indicator'] == 'Cumulative number of confirmed Ebola cases'),
                        ['value']].take([0], axis=1).values)

Convert cd list to NumPy array of float elements:-

cd_arr = np.array(ls)
for i in range(len(cd_arr)):
    if cd_arr[i].ravel().size == 0:
        cd_arr[i] = float(0.0)
    try:
        cd_arr[i] = cd_arr[i].ravel()[0]
    except:
        continue

Get the total Ebola cases:-

totd_arr = x_df.take([2], axis=1).values
totd_arr = totd_arr.ravel()

Compute the percentage of confirmed ebola ratio but first, fix invalid fields

invalid_fields = np.where(totd_arr == 0)[0] 
totd_arr[invalid_fields] = (1, 1) 
cd_arr[invalid_fields] = (0, 0) 
cd_ratio = cd_arr / totd_arr

y = cd_ratio

Create three arrays ready to be replaced by the lowest temperature,

highest temperature and average temperature.

x_lo_T=np.array([i for i in x_country])
x_hi_T=np.array([i for i in x_country])
x_avg_T=np.array([i for i in x_country])

nCountry = len(np.unique(x_country))
for c in range (nCountry):
    np.place(x_lo_T, x_lo_T == np.unique(x_country)[c], T[c][0])
for c in range (nCountry):
    np.place(x_hi_T, x_hi_T == np.unique(x_country)[c], T[c][1])
for c in range (nCountry):
    np.place(x_avg_T, x_avg_T == np.unique(x_country)[c], T[c][2])
X=np.row_stack([xtr_lo_T, xtr_hi_T, xtr_avg_T, x_days]).T
X=X.astype(float)
# Normalization
X=X/np.max(X,axis=1)[:,None]
X.shape

Linear Model

Linear models describe a continuous response variable as a function of one or more predictor variables. They can help you understand and predict the behavior of difficult systems or analyze experimental, financial, and natural data.

We import the linear model and use the first 80% of data for training and the rest for analysis.

from sklearn import linear_model
ntr = int(np.floor(0.8*X.shape[0]))
xtr = X[:ntr,:]
ytr = y[:ntr]
xts = X[ntr:, : ]
yts = y[ntr:]
#linear regression
regr = linear_model.LinearRegression()
#Trainnig model
regr.fit(xtr,ytr)

#Prediction
y_pred = regr.predict(xts)

Therefore, Assume countries and days are correlated,

Xtr = np.column_stack((np.ones(ntr),xtr[:,0],xtr[:,1],xtr[:,2],xtr[:,3]))
Xts = np.column_stack((np.ones(xts.shape[0]),xts[:,0],xts[:,1],xts[:,2],xts[:,3])) 
w1 = ytr[None,:,None]/Xtr[None,:,:] 
w1 = np.mean(W1,axis=1) 
y_pred1 = w1*Xts
import matplotlib
import matplotlib.pyplot as plt
plt.plot(y_pred,yts,'.')
plt.plot([0.25,1.5],[0.25,1.5],'r-',linewidth=1)

Output Graph:-

predicting Ebola virus

Conclusion:

Further ,”The countries with higher temperature (room temperature or above) is likely to increase the speed at which the virus dies in the atmosphere.”

Refrences: https://www.who.int

Leave a Reply

Your email address will not be published. Required fields are marked *