Predict survivors from Titanic tragedy using Machine Learning in Python
Machine Learning has become the most important and used technology in the last ten years. Machine Learning is basically learning done by machine using data given to it. Machine Learning has basically two types – Supervised Learning and Unsupervised Learning. In this tutorial, we will learn how to deal with a simple machine learning problem using Supervised Learning algorithms mainly Classification.
We already have the data of people who boarded titanic. Here we are going to input information of a particular person and get if that person survived or not. I have explored the titanic passenger’s data set and found some interesting patterns. In this tutorial, we will use data analysis and data visualization techniques to find patterns in data. Then we will use Machine learning algorithms to create a model for prediction.
In simple words, this article is to predict the survivors from the Titanic tragedy with Machine Learning in Python. Now continue through this post…
Importing Libraries
First, we import pandas Library that is used to deal with Dataframes. Then we import the numpy library that is used for dealing with arrays. These are the important libraries used overall for data analysis.
Then we Have two libraries seaborn and Matplotlib that is used for Data Visualisation that is a method of making graphs to visually analyze the patterns. In this tutorial, we use RandomForestClassification Algorithm to analyze the data. So we import the RandomForestClassifier from sci-kit learn library to design our model.
# importing main libraries import numpy as np import pandas as pd # importing libraries for visuaisation import seaborn as sn from matplotlib import pyplot as plt from matplotlib import style as st # Using RandomForestClassifier as algorithm from sklearn.ensemble import RandomForestClassifier
Reading the data
Below is our Python program to read the data:
# Reading the training and training set in dataframe using panda test_data = pd.read_csv("test.csv") train_data = pd.read_csv("train.csv")
Analyzing the features of the dataset
# gives the information about the data type and the number of columns of the feature. train_data.info()
The output of the program will be looks like you can see below:
RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5)
This tells us that we have twelve features. There are a total of 891 entries in the training data set. 2 features are float while there are 5 features each with data type int and object. Now from above, we can see Embarked has two values missing which can be easily handled. While age has 177 values missing which will be handled later. Cabin has the most of the missing values i.e 687 values.
train_data.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.00000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
Using the description above we understand that age has missing values. Also, approximately 38% of people in the training set survived.
After Analysing the data that we have now we will start working on the data. First, we give values to all missing and NAN values. So, we can count the number of null values in the columns and make a new data frame named missing to see the statistics of missing value
total = train_data.isnull().sum().sort_values(ascending= False) percent_1 = (train_data.isnull().sum()/ train_data.isnull().count())*100 # count the columns which has null in it. percent_2 = (round(percent_1,1).sort_values(ascending = False)) missing= pd.concat([total,percent_2], axis = 1, keys = ['total','%']) missing.head(5)
total | % | |
---|---|---|
Cabin | 687 | 77.1 |
Age | 177 | 19.9 |
Embarked | 2 | 0.2 |
Fare | 0 | 0.0 |
Ticket | 0 | 0.0 |
We confirm from the above table that Cabin has 687 missing values. Embarked has two while age has 177.
Analysis of correlation using Data Visualisation
After finding the missing values our first step should be to find the correlation between different attributes and class label – ‘Survived’. This will give us information about which attributes are to be used in the final model.
# AGE AND SEX CORRELATION ON SURVIVAL fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(10, 4)) #To make multiple graph axis and to tell where should these graph be placed. female = train_data[train_data['Sex'] == 'female'] male = train_data[train_data['Sex'] == 'male'] # making a kernel density estimation graph for women who survived and women who did not with respect to age. ax = sn.distplot(female[female['Survived'] == 1].Age.dropna(), bins = 20 , label = 'survived', ax = axes[0], kde = False) ax = sn.distplot(female[female['Survived'] == 0].Age.dropna() , bins = 40 , label =' not survived' , ax = axes[0] , kde = False) ax.legend() # adding the box representative ax.set_title('FEMALE') # making a kernel density estimation graph for men who survived and women who did not with respect to age. ax = sn.distplot(male[male['Survived'] == 1].Age.dropna(), bins = 20 , label ='survived', ax =axes[1] ,kde = False) ax = sn.distplot(male[male['Survived'] == 0].Age.dropna(), bins = 40 ,label ='not survived', ax = axes[1],kde =False) ax.legend() ax.set_title("MALE")
After analyzing the output we get to know that there are certain ages where the survival rate is greater. For women survival, chances are higher between 14 and 40. While men have a high probability of survival between 18 and 30. Between the ages of 5 and 18 men have a low probability of survival while that isn’t true for women. So Age is an important attribute to find Survival.
Now we will check the importance of the port of embarkment and pclass for survival.
# We are using point plot to check. This is for port C em = sn.pointplot(x = 'Pclass',y = 'Survived', data = female[female['Embarked']== 'C'],palette=None, order=None, hue_order=None) em = sn.pointplot(x = 'Pclass',y = 'Survived', data = male[male['Embarked']== 'C'],palette=None, order=None, hue_order=None, color = 'r') em.set_title("Class C") # this is for port S em = sn.pointplot(x = 'Pclass',y = 'Survived', data = female[female['Embarked']== 'S'],palette=None, order=None, hue_order=None) em = sn.pointplot(x = 'Pclass',y = 'Survived', data = male[male['Embarked']== 'S'],palette=None, order=None, hue_order=None, color = 'r') em.set_title("Class S") # This is for port Q em = sn.pointplot(x = 'Pclass',y = 'Survived', data = female[female['Embarked']== 'Q'],palette=None, order=None, hue_order=None) em = sn.pointplot(x = 'Pclass',y = 'Survived', data = male[male['Embarked']== 'Q'],palette=None, order=None, hue_order=None, color = 'r') em.set_title("Class Q")
After making plots for there attributes i.e ‘pclass’ vs ‘survived’ for every port. We understand the survival of women is greater than men. Now we will do elaborate research to see if the value of pclass is as important.
sn.barplot(x='Pclass', y='Survived', data=train_data)
This gives us a barplot which shows the survival rate is greater for pclass 1 and lowest for pclass 2.
Now we will take attributes SibSp and Parch. They both basically shows the number of people that were relatives on the ship so we will combine both attributes to form an attribute named ‘Relatives’.
data = [train_data,test_data] for row in data: row['relatives'] = row['SibSp']+ row['Parch'] row.loc[row['relatives']>0,'not_alone'] = 0 row.loc[row['relatives'] == 0,'not_alone']=1 row['not_alone'] = row['not_alone'].astype(int) train_data['not_alone'].value_counts() # this counts number of people who were alone and number who are not.
Output: 1 537 0 354 Name: not_alone, dtype: int64 Above output shows that 537 people are alone and remaining people are with relatives.
ax = sn.pointplot('relatives','Survived',data = train_data, aspect = 2.0)
On further analysis using data visualization, We can see People having between 1-3 relatives has more survival rate
.Suprisingly people with 6 relatives also have a high rate of survival.
Data Processing
Now we will see one by one which attributes we will use for designing our model.
Let us first take passenger id. It is not important for survival as the value of passenger id is unique for every person.
train_data = train_data.drop(['PassengerId'], axis=1) train_data.info()
Output : <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 13 columns): Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object relatives 891 non-null int64 not_alone 891 non-null int64 dtypes: float64(2), int64(6), object(5) Now we have Cabin number. Cabin number is not that important but some useful information can be extracted using this attribute. Every Cabin number is in form C218. So, if we seperate the alphabet we can get deck number which will be crucial for survival.
import re deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "U": 8} data=[train_data,test_data] for row in data: row['Cabin'] = row['Cabin'].fillna('z') row['Deck'] = row['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group()) # grouping all same alpha tog row['Deck']= row['Deck'].map(deck) row['Deck']= row['Deck'].fillna(0) row['Deck']=row['Deck'].astype(int) train_data = train_data.drop(['Cabin'], axis=1) test_data = test_data.drop(['Cabin'], axis=1)
Next, we have Embarked. As we know from the above analysis, Embarked has two values missing so we will first fill those values. As the amount of values to fill is very less we can fill those values from the most common value of port of embarkation.
train_data['Embarked'].describe()
OUTPUT : count 889 unique 3 top S freq 644 Name: Embarked, dtype: object Here 'Top' shows us the most common value. So, we will fill the missing two values with 'S' port of embarkation.
# common value is S which is inserted common_value = 'S' data = [train_data,test_data] for row in data: row['Embarked']= row['Embarked'].fillna(common_value)
Next, we will handle the age attribute which had 177 values missing. For age, we are using mean value and standard deviations and number of null values to randomly fill values between the range.
data = [train_data, test_data] for row in data: mean = train_data['Age'].mean() std = test_data['Age'].std() null = row['Age'].isnull().sum() random_age = np.random.randint(mean - std, mean + std, size = null) age1 = row['Age'].copy() # convert nd array to a dictionary age1[np.isnan(age1)] = random_age row['Age'] = age1 row['Age']= row['Age'].fillna(age1) row['Age']=row['Age'].astype(int) train_data['Age'].isnull().sum()
This will give us an output of ‘zero’ which will show that all the missing values were randomly filled. After handling all the missing values our next step should be to make all the attributes of the same data type.
Normalizing data
We have one attribute named ‘fare’ which has value in the float while there are four attributes with object data type named ‘Name, Sex, Ticket and Embarked’. First, we will convert float to int by working on fare attribute.
# First fare float to int. data = [train_data, test_data] for row in data: row ['Fare']= row['Fare'].fillna(0) row['Fare'] = row['Fare'].astype(int)
The next attribute is ‘Ticket’. Now if we think logically the ticket number is not a factor on which survival depends so we can drop this attribute.
train_data = train_data.drop(['Ticket'], axis=1) test_data = test_data.drop(['Ticket'], axis=1)
So we have dropped ‘ticket’ from the training and test dataset.
Now we will Embarked and Sex into an int by converting their categories into an integer for example if any attribute has two values say male and female then we can make one value as 0 and the other as 1 and then convert all the values in int.
# For Sex from sklearn import preprocessing number = preprocessing.LabelEncoder() train_data['Sex'] = number.fit_transform(train_data['Sex'].astype(str)) test_data['Sex'] = number.fit_transform(test_data['Sex'].astype(str))
# for embarked from sklearn import preprocessing number = preprocessing.LabelEncoder() train_data['Embarked'] = number.fit_transform(train_data['Embarked'].astype(str)) test_data['Embarked'] = number.fit_transform(test_data['Embarked'].astype(str))
Now all values are in int except Name. But if we think over the Name, the only information that we can get from name is the sex of the person which we already have as an attribute. So we can drop this attribute.
# dropping name which is not important factor train_data = train_data.drop(['Name'], axis=1) test_data = test_data.drop(['Name'],axis =1)
Now our data is pre-processed and we have normalized the data. The next step is to categorize the necessary attributes. Like for Age attribute if we put it into bins then we can easily tell if the person will survive or not.
# deviding age in catagories and conerting in numerical form data = [train_data, test_data] for row in data: row['Age'] = row['Age'].astype(int) row.loc[ row['Age'] <= 11, 'Age'] = 0 row.loc[(row['Age'] > 11) & (row['Age'] <= 18), 'Age'] = 1 row.loc[(row['Age'] > 18) & (row['Age'] <= 22), 'Age'] = 2 row.loc[(row['Age'] > 22) & (row['Age'] <= 27), 'Age'] = 3 row.loc[(row['Age'] > 27) & (row['Age'] <= 33), 'Age'] = 4 row.loc[(row['Age'] > 33) & (row['Age'] <= 40), 'Age'] = 5 row.loc[(row['Age'] > 40) & (row['Age'] <= 66), 'Age'] = 6 row.loc[row['Age'] > 66, 'Age'] = 6
Next, we are creating two new attributes named age_class and fare_per_person.
# A new feature age_class data = [train_data, test_data] for dataset in data: dataset['Age_Class']= dataset['Age']* dataset['Pclass']
As fare as a whole is not important we will create a new attribute fare_per_person and drop fare from the test and training set.
# new feature attribute per person for row in data: row['Fare_Per_Person'] = row['Fare']/(row['relatives']+1) row['Fare_Per_Person'] = row['Fare_Per_Person'].astype(int) train_data = train_data.drop(['Fare'],axis = 1) test_data = test_data.drop(['Fare'],axis = 1)
We have completed all the manipulations with data. The next step is to make a machine learning model.
- Also, read: How to prepare your own dataset for image classification in Machine learning with Python
Machine Learning Model
We will use the Random forest classifier for this problem.
# Building machine learning model X_train = train_data.drop("Survived", axis=1) Y_train = train_data["Survived"] X_test = test_data.drop("PassengerId", axis=1).copy() random_forest = RandomForestClassifier(n_estimators=100) random_forest.fit(X_train, Y_train) Y_prediction = random_forest.predict(X_test) random_forest.score(X_train, Y_train) random_forest_score = round(random_forest.score(X_train, Y_train) * 100, 2) random_forest_score
Output:
94.39
This gives us the accuracy rate of the model i.e 94.39%.
K-Fold Cross-Validation:
This splits the data randomly into k subsets called folds. Let’s say we have 4 folds, then our model will be trained and evaluated 4 times. Every time it is evaluated on 1 fold and trained on the other three folds. The result of this K-Fold Cross Validation would be an array that contains 4 different scores. We then compute the mean and the standard deviation for these scores. Below is the code for K-fold Cross-Validation.
# K-Fold Cross Validation from sklearn.model_selection import cross_val_score r = RandomForestClassifier(n_estimators=100) scores = cross_val_score(r, X_train, Y_train, cv=10, scoring = "accuracy") print("Scores:", scores) print("Mean:", scores.mean()) print("Standard Deviation:", scores.std())
Output:
Scores: [0.77777778 0.8 0.75280899 0.80898876 0.85393258 0.82022472 0.80898876 0.79775281 0.84269663 0.88636364] Mean: 0.814953467256838
Standard Deviation: 0.03640171045208266
This shows our model has a mean accuracy of 82% and the standard deviation of 4%.This means the accuracy of our model can differ +-4%. Now we will see the importance of the attributes used in the model formation.
# importance of different attributes imp = pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_forest.feature_importances_,3)}) imp = imp.sort_values('importance',ascending=False).set_index('feature') imp.head(15)
Output:
importance | |
---|---|
feature | |
Sex | 0.288 |
Fare_Per_Person | 0.201 |
Age_Class | 0.106 |
Deck | 0.077 |
Age | 0.075 |
Pclass | 0.065 |
relatives | 0.056 |
Embarked | 0.053 |
SibSp | 0.037 |
Parch | 0.026 |
not_alone | 0.016 |
We can see not_alone and Parch has the least importance so we drop these attributes.
# dropping the attributes that has least importances train_data = train_data.drop("not_alone", axis=1) test_data = test_data.drop("not_alone", axis=1) train_data = train_data.drop("Parch", axis=1) test_data = test_data.drop("Parch", axis=1)
Once again we will find the score of the model. It should be the same as before i.e 94.39. This shows that those attributes actually weren’t important for this model.
random_forest = RandomForestClassifier(n_estimators=100, oob_score = True) random_forest.fit(X_train, Y_train) Y_prediction = random_forest.predict(X_test) random_forest.score(X_train, Y_train) acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2) print(round(acc_random_forest,2,), "%")
Output:
94.39
Now we will find Out-of-Bag score to see the accuracy of this model using 4 folds.
# oob score with 4 folds. print("oob score:", round(random_forest.oob_score_, 4)*100, "%")
Output:
oob score: 81.58999999999999 %
# Hyperparameter tuning para_grid = { "criterion" : ["gini", "entropy"], "min_samples_leaf" : [1, 5, 10, 25, 50, 70], "min_samples_split" : [2, 4, 10, 12, 16, 18, 25, 35], "n_estimators": [100, 400, 700, 1000, 1500]} from sklearn.model_selection import GridSearchCV, cross_val_score r = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1, n_jobs=-1) cf = GridSearchCV(estimator=rf, param_grid=para_grid, n_jobs=-1) cf.fit(X_train, Y_train) cf.best_params_
Output:
{'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 16, 'n_estimators': 100} Now we will find oob score again after Hyperparameter tuning.
# Testing our model using gini index and finding the out of bag error score. random_forest = RandomForestClassifier(criterion = "gini", min_samples_leaf = 1, min_samples_split = 10, n_estimators=100, max_features='auto', oob_score=True, random_state=1, n_jobs=-1) random_forest.fit(X_train, Y_train) Y_prediction = random_forest.predict(X_test) random_forest.score(X_train, Y_train) print("oob score:", round(random_forest.oob_score_, 4)*100, "%")
Output:
oob score: 81.93 %
This shows that our model has an accuracy of 94.39% and oob score of 81.93%.
DIFFERENT SCORES
CONFUSION MATRIX
from sklearn.model_selection import cross_val_predict from sklearn.metrics import confusion_matrix predictions = cross_val_predict(random_forest, X_train, Y_train, cv=3) confusion_matrix(Y_train, predictions)
Output:
array([[480, 69], [ 95, 247]])
The confusion matrix shows the number of people who survived and were predicted dead these are called false negatives. While it also shows people who were dead but predicted survived. Such predictions are called false positives. Here 69 and 95 are number of false positive and false negatives respectively.
ROC-AUC SCORE
from sklearn.metrics import roc_auc_score r_a_score = roc_auc_score(Y_train, y_scores) print("ROC-AUC-Score:", r_a_score)
Output:
ROC-AUC-Score: 0.9465109342877535
This output shows a score of 95% which is a very good score. It is simply computed by measuring the area under the curve, which is called AUC. A classifier that is 100% correct, would have a ROC AUC Score of 1 and a completely random classifier would have a score of 0.5. Our classifier had a roc score of 0.95 so it is a good classifier.
Now we have our model so we can easily do further predictions. Our model is ready to predict Predict survivors from Titanic tragedy.
Which part is supervised and unsupervised?