Predict survivors from Titanic tragedy using Machine Learning in Python

Machine Learning has become the most important and used technology in the last ten years. Machine Learning is basically learning done by machine using data given to it. Machine Learning has basically two types –  Supervised Learning and Unsupervised Learning. In this tutorial, we will learn how to deal with a simple machine learning problem using Supervised Learning algorithms mainly Classification.

We already have the data of people who boarded titanic. Here we are going to input information of a particular person and get if that person survived or not. I have explored the titanic passenger’s data set and found some interesting patterns. In this tutorial, we will use data analysis and data visualization techniques to find patterns in data. Then we will use Machine learning algorithms to create a model for prediction.

In simple words, this article is to predict the survivors from the Titanic tragedy with Machine Learning in Python. Now continue through this post…

Importing Libraries

First, we import pandas Library that is used to deal with Dataframes. Then we import the numpy library that is used for dealing with arrays. These are the important libraries used overall for data analysis.

Then we Have two libraries seaborn and Matplotlib that is used for Data Visualisation that is a method of making graphs to visually analyze the patterns. In this tutorial, we use RandomForestClassification Algorithm to analyze the data. So we import the RandomForestClassifier from sci-kit learn library to design our model.

 

# importing main libraries
import numpy as np
import pandas as pd

# importing libraries for visuaisation
import seaborn as sn
from matplotlib import pyplot as plt
from matplotlib import style as st

# Using RandomForestClassifier as algorithm
from sklearn.ensemble import RandomForestClassifier

Reading the data

Below is our Python program to read the data:

# Reading the training and training set in dataframe using panda 
test_data = pd.read_csv("test.csv") 
train_data = pd.read_csv("train.csv")

Analyzing the features of the dataset

# gives the information about the data type and the number of columns of the feature.
train_data.info()

The output of the program will be looks like you can see below:

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)

This tells us that we have twelve features. There are a total of 891 entries in the training data set. 2 features are float while there are 5 features each with data type int and object. Now from above, we can see Embarked has two values missing which can be easily handled. While age has 177 values missing which will be handled later. Cabin has the most of the missing values i.e 687 values.

train_data.describe()

PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.00000 0.000000 3.000000 28.000000 0.000000
0.000000
14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Using the description above we understand that age has missing values. Also, approximately 38% of people in the training set survived.

After Analysing the data that we have now we will start working on the data. First, we give values to all missing and NAN values. So, we can count the number of null values in the columns and make a new data frame named missing to see the statistics of missing value

total = train_data.isnull().sum().sort_values(ascending= False)
percent_1 = (train_data.isnull().sum()/ train_data.isnull().count())*100  # count the columns which has null in it.
percent_2 = (round(percent_1,1).sort_values(ascending = False))
missing=  pd.concat([total,percent_2], axis = 1, keys = ['total','%'])
missing.head(5)
total %
Cabin 687 77.1
Age 177 19.9
Embarked 2 0.2
Fare 0 0.0
Ticket 0 0.0

We confirm from the above table that Cabin has 687 missing values. Embarked has two while age has 177.

Analysis of correlation using Data Visualisation

After finding the missing values our first step should be to find the correlation between different attributes and class label – ‘Survived’. This will give us information about which attributes are to be used in the final model.

# AGE AND SEX CORRELATION ON SURVIVAL

fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(10, 4)) #To make multiple graph axis and to tell where should these graph be placed.
female = train_data[train_data['Sex'] == 'female']
male = train_data[train_data['Sex'] == 'male']
# making a kernel density estimation graph for women who survived and women who did not with respect to age.
ax = sn.distplot(female[female['Survived'] == 1].Age.dropna(), bins = 20 , label = 'survived', ax = axes[0], kde = False)
ax = sn.distplot(female[female['Survived'] == 0].Age.dropna() , bins = 40 , label =' not survived' ,  ax = axes[0] , kde = False)
ax.legend()  # adding the box representative
ax.set_title('FEMALE')
# making a kernel density estimation graph for men who survived and women who did not with respect to age.
ax = sn.distplot(male[male['Survived'] == 1].Age.dropna(), bins = 20 , label ='survived', ax =axes[1] ,kde = False)
ax = sn.distplot(male[male['Survived'] == 0].Age.dropna(), bins = 40 ,label ='not survived', ax = axes[1],kde =False)
ax.legend()
ax.set_title("MALE")

After analyzing the output we get to know that there are certain ages where the survival rate is greater. For women survival, chances are higher between 14 and 40. While men have a high probability of survival between 18 and 30. Between the ages of 5 and 18 men have a low probability of survival while that isn’t true for women. So Age is an important attribute to find Survival.

Now we will check the importance of the port of embarkment and pclass for survival.

# We are using point plot to check. This is for port C
em = sn.pointplot(x = 'Pclass',y = 'Survived', data =  female[female['Embarked']== 'C'],palette=None,  order=None, hue_order=None)
em = sn.pointplot(x = 'Pclass',y = 'Survived', data =  male[male['Embarked']== 'C'],palette=None,  order=None, hue_order=None, color = 'r')
em.set_title("Class C")

# this is for port S
em = sn.pointplot(x = 'Pclass',y = 'Survived', data =  female[female['Embarked']== 'S'],palette=None,  order=None, hue_order=None)
em = sn.pointplot(x = 'Pclass',y = 'Survived', data =  male[male['Embarked']== 'S'],palette=None,  order=None, hue_order=None, color = 'r')
em.set_title("Class S")

# This is for port Q
em = sn.pointplot(x = 'Pclass',y = 'Survived', data = female[female['Embarked']== 'Q'],palette=None,  order=None, hue_order=None)
em = sn.pointplot(x = 'Pclass',y = 'Survived', data = male[male['Embarked']== 'Q'],palette=None,  order=None, hue_order=None, color = 'r')
em.set_title("Class Q")

After making plots for there attributes i.e ‘pclass’ vs ‘survived’ for every port. We understand the survival of women is greater than men. Now we will do elaborate research to see if the value of pclass is as important.

sn.barplot(x='Pclass', y='Survived', data=train_data)

This gives us a barplot which shows the survival rate is greater for pclass 1 and lowest for pclass 2.

Now we will take attributes SibSp and Parch. They both basically shows the number of people that were relatives on the ship so we will combine both attributes to form an attribute named ‘Relatives’.

data = [train_data,test_data]
for row in data:
    row['relatives'] = row['SibSp']+ row['Parch']
    row.loc[row['relatives']>0,'not_alone'] = 0
    row.loc[row['relatives'] == 0,'not_alone']=1
    row['not_alone'] = row['not_alone'].astype(int)
    
train_data['not_alone'].value_counts()
# this counts number of people who were alone and number who are not.
Output:

1    537
0    354
Name: not_alone, dtype: int64

Above output shows that 537 people are alone and remaining people are with relatives.
ax = sn.pointplot('relatives','Survived',data = train_data, aspect = 2.0)

On further analysis using data visualization, We can see People having between 1-3 relatives has more survival rate
.Suprisingly people with 6 relatives also have a high rate of survival.

Data Processing

Now we will see one by one which attributes we will use for designing our model.

Let us first take passenger id. It is not important for survival as the value of passenger id is unique for every person.

train_data = train_data.drop(['PassengerId'], axis=1)
train_data.info()
Output :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
Survived     891 non-null int64
Pclass       891 non-null int64
Name         891 non-null object
Sex          891 non-null object
Age          714 non-null float64
SibSp        891 non-null int64
Parch        891 non-null int64
Ticket       891 non-null object
Fare         891 non-null float64
Cabin        204 non-null object
Embarked     889 non-null object
relatives    891 non-null int64
not_alone    891 non-null int64
dtypes: float64(2), int64(6), object(5)


Now we have Cabin number. Cabin number is not that important but some useful information can be extracted using 
this attribute. Every Cabin number is in form C218. So, if we seperate the alphabet we can get deck number 
which will be crucial for survival.

import re
deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "U": 8}
data=[train_data,test_data]
for row in data:
    row['Cabin'] = row['Cabin'].fillna('z')
    row['Deck'] = row['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group()) # grouping all same alpha tog
    row['Deck']= row['Deck'].map(deck)
    row['Deck']= row['Deck'].fillna(0)
    row['Deck']=row['Deck'].astype(int)
    
train_data = train_data.drop(['Cabin'], axis=1)
test_data = test_data.drop(['Cabin'], axis=1)

Next, we have Embarked. As we know from the above analysis, Embarked has two values missing so we will first fill those values. As the amount of values to fill is very less we can fill those values from the most common value of port of embarkation.

train_data['Embarked'].describe() 
OUTPUT :
count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

Here 'Top' shows us the most common value. So, we will fill the missing two values with 'S' port of embarkation.
# common value is S which is inserted
common_value = 'S'
data = [train_data,test_data]
for row in data:
    row['Embarked']= row['Embarked'].fillna(common_value)

Next, we will handle the age attribute which had 177 values missing. For age, we are using mean value and standard deviations and number of null values to randomly fill values between the range.

data = [train_data, test_data]
for row in data:
    mean = train_data['Age'].mean()
    std = test_data['Age'].std()
    null = row['Age'].isnull().sum()
    random_age = np.random.randint(mean - std, mean + std, size = null)
    age1 = row['Age'].copy() # convert nd array to a dictionary
    age1[np.isnan(age1)] = random_age
    row['Age'] = age1
    row['Age']= row['Age'].fillna(age1)
    row['Age']=row['Age'].astype(int)
    
train_data['Age'].isnull().sum()

This will give us an output of  ‘zero’ which will show that all the missing values were randomly filled. After handling all the missing values our next step should be to make all the attributes of the same data type.

Normalizing data

We have one attribute named ‘fare’ which has value in the float while there are four attributes with object data type named ‘Name, Sex, Ticket and Embarked’. First, we will convert float to int by working on fare attribute.

# First fare float to int.
data = [train_data, test_data]
for row in data:
    row ['Fare']= row['Fare'].fillna(0)
    row['Fare'] =  row['Fare'].astype(int)

The next attribute is ‘Ticket’. Now if we think logically the ticket number is not a factor on which survival depends so we can drop this attribute.

train_data = train_data.drop(['Ticket'], axis=1)
test_data = test_data.drop(['Ticket'], axis=1)

So we have dropped ‘ticket’ from the training and test dataset.

Now we will Embarked and Sex into an int by converting their categories into an integer for example if any attribute has two values say male and female then we can make one value as 0 and the other as 1 and then convert all the values in int.

# For Sex
from sklearn import preprocessing
number = preprocessing.LabelEncoder()
train_data['Sex'] = number.fit_transform(train_data['Sex'].astype(str))
test_data['Sex'] = number.fit_transform(test_data['Sex'].astype(str))
# for embarked
from sklearn import preprocessing
number = preprocessing.LabelEncoder()
train_data['Embarked'] = number.fit_transform(train_data['Embarked'].astype(str))
test_data['Embarked'] = number.fit_transform(test_data['Embarked'].astype(str))

Now all values are in int except Name. But if we think over the Name, the only information that we can get from name is the sex of the person which we already have as an attribute. So we can drop this attribute.

# dropping name which is not important factor
train_data = train_data.drop(['Name'], axis=1)
test_data = test_data.drop(['Name'],axis =1)

Now our data is pre-processed and we have normalized the data. The next step is to categorize the necessary attributes. Like for Age attribute if we put it into bins then we can easily tell if the person will survive or not.

# deviding age in catagories and conerting in numerical form
data = [train_data, test_data]
for row in data:
    row['Age'] = row['Age'].astype(int)
    row.loc[ row['Age'] <= 11, 'Age'] = 0
    row.loc[(row['Age'] > 11) & (row['Age'] <= 18), 'Age'] = 1
    row.loc[(row['Age'] > 18) & (row['Age'] <= 22), 'Age'] = 2
    row.loc[(row['Age'] > 22) & (row['Age'] <= 27), 'Age'] = 3
    row.loc[(row['Age'] > 27) & (row['Age'] <= 33), 'Age'] = 4
    row.loc[(row['Age'] > 33) & (row['Age'] <= 40), 'Age'] = 5
    row.loc[(row['Age'] > 40) & (row['Age'] <= 66), 'Age'] = 6
    row.loc[row['Age'] > 66, 'Age'] = 6

Next, we are creating two new attributes named age_class and fare_per_person.

# A new feature age_class
data = [train_data, test_data]
for dataset in data:
    dataset['Age_Class']= dataset['Age']* dataset['Pclass']

As fare as a whole is not important we will create a new attribute fare_per_person and drop fare from the test and training set.

# new feature attribute per person
for row in data:
    row['Fare_Per_Person'] = row['Fare']/(row['relatives']+1)
    row['Fare_Per_Person'] = row['Fare_Per_Person'].astype(int)

train_data = train_data.drop(['Fare'],axis = 1)
test_data = test_data.drop(['Fare'],axis = 1)

We have completed all the manipulations with data. The next step is to make a machine learning model.

Machine Learning Model

We will use the Random forest classifier for this problem.

# Building machine learning model
X_train = train_data.drop("Survived", axis=1)
Y_train = train_data["Survived"]
X_test  = test_data.drop("PassengerId", axis=1).copy()
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)
random_forest_score = round(random_forest.score(X_train, Y_train) * 100, 2)
random_forest_score

Output:

94.39

This gives us the accuracy rate of the model i.e 94.39%.

K-Fold Cross-Validation:

This splits the data randomly into k subsets called folds. Let’s say we have 4 folds, then our model will be trained and evaluated 4 times. Every time it is evaluated on 1 fold and trained on the other three folds. The result of this K-Fold Cross Validation would be an array that contains 4 different scores. We then compute the mean and the standard deviation for these scores. Below is the code for K-fold Cross-Validation.

#  K-Fold Cross Validation 
from sklearn.model_selection import cross_val_score
r = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(r, X_train, Y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())
Output:

Scores: [0.77777778 0.8 0.75280899 0.80898876 0.85393258 0.82022472 0.80898876 0.79775281 0.84269663 0.88636364] Mean: 0.814953467256838

Standard Deviation: 0.03640171045208266

This shows our model has a mean accuracy of 82% and the standard deviation of 4%.This means the accuracy of our model can differ +-4%. Now we will see the importance of the attributes used in the model formation.

# importance of different attributes
imp = pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_forest.feature_importances_,3)})
imp = imp.sort_values('importance',ascending=False).set_index('feature')
imp.head(15)

Output:

importance
feature
Sex 0.288
Fare_Per_Person 0.201
Age_Class 0.106
Deck 0.077
Age 0.075
Pclass 0.065
relatives 0.056
Embarked 0.053
SibSp 0.037
Parch 0.026
not_alone 0.016

 

We can see not_alone and Parch has the least importance so we drop these attributes.

# dropping the attributes that has least importances

train_data  = train_data.drop("not_alone", axis=1)
test_data  = test_data.drop("not_alone", axis=1)

train_data = train_data.drop("Parch", axis=1)
test_data = test_data.drop("Parch", axis=1)

Once again we will find the score of the model. It should be the same as before i.e 94.39. This shows that those attributes actually weren’t important for this model.

random_forest = RandomForestClassifier(n_estimators=100, oob_score = True)
random_forest.fit(X_train, Y_train)
Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)

acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
print(round(acc_random_forest,2,), "%")

Output:

94.39

Now we will find Out-of-Bag score to see the accuracy of this model using 4 folds.

# oob score with 4 folds.
print("oob score:", round(random_forest.oob_score_, 4)*100, "%")

Output:

oob score: 81.58999999999999 %
# Hyperparameter tuning
para_grid = { "criterion" : ["gini", "entropy"], "min_samples_leaf" : [1, 5, 10, 25, 50, 70], "min_samples_split" : [2, 4, 10, 12, 16, 18, 25, 35], "n_estimators": [100, 400, 700, 1000, 1500]}
from sklearn.model_selection import GridSearchCV, cross_val_score
r = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1, n_jobs=-1)
cf = GridSearchCV(estimator=rf, param_grid=para_grid, n_jobs=-1)
cf.fit(X_train, Y_train)
cf.best_params_

Output:

{'criterion': 'gini',
 'min_samples_leaf': 1,
 'min_samples_split': 16,
 'n_estimators': 100}



Now we will find oob score again after Hyperparameter tuning.
# Testing our model using gini index and finding the out of bag error score.
random_forest = RandomForestClassifier(criterion = "gini", 
                                       min_samples_leaf = 1, 
                                       min_samples_split = 10,   
                                       n_estimators=100, 
                                       max_features='auto', 
                                       oob_score=True, 
                                       random_state=1, 
                                       n_jobs=-1)

random_forest.fit(X_train, Y_train) 
Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)

print("oob score:", round(random_forest.oob_score_, 4)*100, "%")

Output:

oob score: 81.93 %

This shows that our model has an accuracy of 94.39% and oob score of 81.93%.

 

DIFFERENT SCORES

CONFUSION MATRIX

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
predictions = cross_val_predict(random_forest, X_train, Y_train, cv=3)
confusion_matrix(Y_train, predictions)

Output:

array([[480, 69], [ 95, 247]])

The confusion matrix shows the number of people who survived and were predicted dead these are called false negatives. While it also shows people who were dead but predicted survived. Such predictions are called false positives. Here 69 and 95 are number of false positive and false negatives respectively.

ROC-AUC SCORE

from sklearn.metrics import roc_auc_score
r_a_score = roc_auc_score(Y_train, y_scores)
print("ROC-AUC-Score:", r_a_score)

Output:

ROC-AUC-Score: 0.9465109342877535

This output shows a score of 95% which is a very good score. It is simply computed by measuring the area under the curve, which is called AUC. A classifier that is 100% correct, would have a ROC AUC Score of 1 and a completely random classifier would have a score of 0.5. Our classifier had a roc score of 0.95 so it is a good classifier.

Now we have our model so we can easily do further predictions. Our model is ready to predict Predict survivors from Titanic tragedy.

 

Leave a Reply

Your email address will not be published. Required fields are marked *