Feature Engineering, Data Exploration, and Classification using Titanic Dataset

In machine learing, what perhaps is more important than defining a model or a neural network is data exploration. After we get the dataset, that is hopefully representative of the problem we are trying to solve, the first thing we do is to examine the data and see what characteristics and features are present. Are there any problems in the data? Are some categories under-represented? Are there missing values? Are there significant correlations between some features and outcome? Can we easily engineer new features from existing ones that will correlate better with the outcome? Data exploration not only helps to discover and eliminates some potential problems before model training begins, but will also makes our model more accurate by injecting some human intuition into the solution.

In this tutorial, we will use the Titanic passenger survival dataset to illustrate the concept of data exploration and feature engineering. We choose this dataset because it is simple, and thus allows us to focus on engineering the features. We will show how engineered features play an important role in prediction by being the most important feature in machine learning models. Then, we will illustrate compare several methods of binary classification that can used to predict whether a passenger survives the disaster.

The Titanic dataset can be download from this website. Prediction Titanic passenger survival is also a Kaggle challege. Note that because the passenger survival information is public, the Kaggle challenge leaderboard is spammed with people submitting actual real-world data as machine learning results, making it meaningless.

1. Data Exploration

We first load the "titanic.csv" file into a Pandas dataframe. We will immediately print the first few lines of the file.

In [8]:
import pandas as pd

data = pd.read_csv('titanic.csv')
data.head(10)
Out[8]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
0 1 1 Allen, Miss. Elisabeth Walton female 29.00 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO
1 1 1 Allison, Master. Hudson Trevor male 0.92 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON
2 1 0 Allison, Miss. Helen Loraine female 2.00 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.00 1 2 113781 151.5500 C22 C26 S NaN 135.0 Montreal, PQ / Chesterville, ON
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.00 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
5 1 1 Anderson, Mr. Harry male 48.00 0 0 19952 26.5500 E12 S 3 NaN New York, NY
6 1 1 Andrews, Miss. Kornelia Theodosia female 63.00 1 0 13502 77.9583 D7 S 10 NaN Hudson, NY
7 1 0 Andrews, Mr. Thomas Jr male 39.00 0 0 112050 0.0000 A36 S NaN NaN Belfast, NI
8 1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson) female 53.00 2 0 11769 51.4792 C101 S D NaN Bayside, Queens, NY
9 1 0 Artagaveytia, Mr. Ramon male 71.00 0 0 PC 17609 49.5042 NaN C NaN 22.0 Montevideo, Uruguay

We immediately notice that the second column "survived" is our prediction result. The "name" column contains not only the first and last name, but also the title, which could be an important indicator of socio-economic status, which could be a factor in the survival probability. The "sibsp" columns contains the number of siblings and spouses a passenger have on the ship. The "parch" column is the number of parent or guardian the passenger have on the ship. The two columns added together is the size of the family together on the ship. The "ticket" column is the ticket number, which is unique for vast majority of the passengers. Some tickets have alphabets in front. The "embarked" column is the port of departure. This could be another indicator of socio-economic status along with the "fare" column. The "boat" and "body" column is information that should not be used to predict survival rate, since it is statistics gathered after the event. These two columns should be dropped. The "cabin" column is very interesting, it contains the deck information and the room number. Which deck the passenger resides in could be an indicator of survival rate, and should be extracted. Room number on the other hand, whithout knowing the actual layout of the ship, is not very useful. Finally, the "home.dest" column shows the home and destination of the passenger, which could be another indicator of socio-economic status.

We also immdediatly see that some data is missing. We need to examine in detail which data is missing and determine the best way to fill in these missing data without skewing the prediction results.

We can use the folloing command to list the categories in the dataset.

In [9]:
print(data.columns.values)
['pclass' 'survived' 'name' 'sex' 'age' 'sibsp' 'parch' 'ticket' 'fare'
 'cabin' 'embarked' 'boat' 'body' 'home.dest']

From these categories, will will drop "boat", "body" from the dataset since they cannot be used as predictors and they are also not the outcome. We will also drop the "home.dest" column since we already have many other good indicators of socio-economic status. Keeping the "home.dest" column will not cause harm, we just want to keep our model a simpler.

In [10]:
drop_cat=['boat','body','home.dest']
data.drop(drop_cat, inplace=True, axis=1)
data.head(10)
Out[10]:
pclass survived name sex age sibsp parch ticket fare cabin embarked
0 1 1 Allen, Miss. Elisabeth Walton female 29.00 0 0 24160 211.3375 B5 S
1 1 1 Allison, Master. Hudson Trevor male 0.92 1 2 113781 151.5500 C22 C26 S
2 1 0 Allison, Miss. Helen Loraine female 2.00 1 2 113781 151.5500 C22 C26 S
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.00 1 2 113781 151.5500 C22 C26 S
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.00 1 2 113781 151.5500 C22 C26 S
5 1 1 Anderson, Mr. Harry male 48.00 0 0 19952 26.5500 E12 S
6 1 1 Andrews, Miss. Kornelia Theodosia female 63.00 1 0 13502 77.9583 D7 S
7 1 0 Andrews, Mr. Thomas Jr male 39.00 0 0 112050 0.0000 A36 S
8 1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson) female 53.00 2 0 11769 51.4792 C101 S
9 1 0 Artagaveytia, Mr. Ramon male 71.00 0 0 PC 17609 49.5042 NaN C

Let use directly goto the point and see how many passengers survived the disaster.

In [12]:
data['survived'].mean()
Out[12]:
0.3819709702062643

About 40% of the passengers survived. We know from historical records that women and children were given priority on the lifeboats. Was this true, and did it reflect in their survival rates? Lets take a look.

In [70]:
print('Survival Rate by Sex')
print(data['survived'].groupby(data['sex']).mean())
print('\n\nSex Ratio of Passengers')
print(data['sex'].value_counts(normalize=True))
Survival Rate by Sex
sex
female    0.727468
male      0.190985
Name: survived, dtype: float64


Sex Ratio of Passengers
male      0.644003
female    0.355997
Name: sex, dtype: float64

We notice that 72% of the female passengers survived, despite making up only 35% of the total passengers. So indeed women had a larger survival rate than men. But what about age? Lets plot the passenger age historgram.

In [23]:
hist = data['age'].hist(bins=30)

Looks like most of the pasengers are in their twenties and thirties. There are lots of children too. How about their survival rate?

In [27]:
data['survived'].groupby(pd.cut(data['age'], 20)).mean().plot(kind='bar')
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d7ca7947f0>

It does look like children have a higher survival rate. The few elderly on the ship also survived.

How about soci-economic status. Do rich people have a survival rate higher than the poor? Lets look at fare first.

In [64]:
data['survived'].groupby(pd.cut(data['fare'], [0,5,10,20,40,70,100,1000])).mean().plot(kind='bar')
Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d7cd259860>

I definitly see a trend here. If you paid more than \$70 for your ticket, your chances of survival are above 60%. If you paied less than \$10, your chances are only at 20%. Lets look at passenger class.

In [71]:
data['survived'].groupby(data['pclass']).mean().plot(kind='bar')
Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d7cd44e5c0>

Clearly 1st class passengers have an advantage of 3rd class.

Now that we have explored our dataset we have fairly good idea on what features are good indicators of survival rate. We maybe able to increase the accuracy of our prediction by engineer new features. For example, maybe doctors have higher survival rate, and military personel have lower survival rate. We can engineer additional features from the features that are already given to us. These additional features represent human intuition, and is very difficult for a machine model to discern.

2. Feature Engineering

Once we have a good understanding of the dataset we are working with, we can engineer some new features from it to make our predictions more accurate. We first extract the first, last names and the title from the "name" column.

In [11]:
data['first name'] = data['name'].str.split(',|\\.',expand = True)[2] #expand set to True to return a df instead of series
data['first name'] = data['first name'].str.strip() #strip leading and trailing white spaces
data['last name'] = data['name'].str.split(',|\\.',expand = True)[0] #expand set to True to return a df instead of series
data['last name'] = data['last name'].str.strip()
data['title'] = data['name'].str.split(',|\\.',expand = True)[1] #expand set to True to return a df instead of series
data['title'] = data['title'].str.strip()

data['title'].value_counts() #just display the name column summary
Out[11]:
Mr              757
Miss            260
Mrs             197
Master           61
Dr                8
Rev               8
Col               4
Ms                2
Mlle              2
Major             2
Jonkheer          1
Sir               1
the Countess      1
Capt              1
Mme               1
Don               1
Lady              1
Dona              1
Name: title, dtype: int64

We display the titles gathered from the "name" column. Note that aside from the common titles, there are some religions titles ("Rev"), some noble titles ("Jonkheer", "Don", etc), and some military titles ("Capt", etc). We will sort these titles into social status. Our intuition is that the social status of a person have an impact on their survival rate, and can increase our prediction accuracy when used.

In [72]:
status_map={'Capt':'Military',
            'Col':'Military',
            'Don':'Noble',
            'Dona':'Noble',
            'Dr':'Dr',
            'Jonkheer':'Noble',
            'Lady':'Noble',
            'Major':'Military',
            'Master':'Common',
            'Miss':'Common',
            'Mlle':'Common',
            'Mme':'Common',
            'Mr':'Common',
            'Mrs':'Common',
            'Ms':'Common',
            'Rev':'Clergy',
            'Sir':'Noble',
            'the Countess':'Noble',
            }

data['social status'] = data['title'].map(status_map)

What about the size of the family? Would a larger family have more chances of survival?

In [73]:
data['family members'] = data['parch'] + data['sibsp']

What about the deck the passenger resides in. Would a higher deck offer more chances of escape?

In [74]:
data['deck'] = data['cabin'].str.replace('[0-9]','').str.split(' ', expand=True)[0]
#1. delete cabin number, leaving leading letter (deck); 2. since multiple cabins are assigned, just get the first one, they are all on the same deck

What about the lenght of the name, or length of the ticket number? Could they hold some mysteries meaning that we can't discern but a machine model can extract?

In [75]:
data['name length'] = data['name'].apply(lambda x: len(x))
data['ticket length'] = data['ticket'].apply(lambda x: len(x))

Let's see if any of the newly engineered features are good indicators of survival. Lets first look at deck.

In [77]:
data['survived'].groupby(data['social status']).mean().plot(kind='bar')
Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d7cd0fe5c0>

Well, it looks like if you are nobility, you escaped with high probability. On the other hand, if you are clergy, you may have stayed behind. What about family size?

In [79]:
data['survived'].groupby(data['family members']).mean().plot(kind='bar')
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d7cd56aeb8>

So it looks like smaller families survived with higher rate, parents probably left with their children. Single passengers probably stayed behind more. Interestingly, larger family have lower survival rate. Perhaps they could not all fit into a single lifeboat and chose to stay behind together.

The newly engineered features are indeed good indicators to the survival rate. We will use them as part of our model to increase model accuracy. These features are easily obtained using human intuition, but is hard for a model to generate by itself.

3. Data Augmentation

3.1 Fill in Missing Values

Before we feed our features into any machine model, we need to fill in the missing values and make sure the data format is correct. First, let's see which columns has missing values.

In [80]:
data.isnull().any()
Out[80]:
pclass            False
survived          False
name              False
sex               False
age                True
sibsp             False
parch             False
ticket            False
fare               True
cabin              True
embarked           True
first name        False
last name         False
title             False
social status     False
family members    False
deck               True
name length       False
ticket length     False
dtype: bool

It looks like "age", "fare", "cabin", "embarked", and "deck" have missing values. We must take care filling these values. For categorical data, we can just leave it blank, since "not available" is also a category and may indeed convey some information. Filling a missing categorical data could generate some artificial information which could be harmful to the model. For non-categorical data, we may use the mean value, but we would like to note that this entry is artificiall generated. So we create another column indicating this.

  1. for "age" column, we use mean value to fill, and use another column to indicate this value is artificial.
In [81]:
data['age available'] = ~data['age'].isnull()
data['age'] = data['age'].fillna(data['age'].mean())
  1. For "fare" data we also use mean value to fill, and use another column to indicate this value is artificial.
In [82]:
data['fare available'] = ~data['fare'].isnull()
data['fare'] = data['fare'].fillna(data['fare'].mean())
  1. For "deck" column, since this is categorical, we use NA to indicate missing values
In [83]:
data['deck'] = data['deck'].fillna('NA')
  1. For "embarked" column, we also use NA to indicate missing values
In [84]:
data['embarked'] = data['embarked'].fillna('NA')
  1. Finally we drop the "cabin" column since we already gleened "deck" information from it, and we are not using the room numbers

Check once again which column have missing data. We will drop the "cabin" column in the next section

In [85]:
data.isnull().any()
Out[85]:
pclass            False
survived          False
name              False
sex               False
age               False
sibsp             False
parch             False
ticket            False
fare              False
cabin              True
embarked          False
first name        False
last name         False
title             False
social status     False
family members    False
deck              False
name length       False
ticket length     False
age available     False
fare available    False
dtype: bool

3.2 Data Transformation

Once all missing data is filled in. We will extract the columns that we will use in our model.

In [88]:
import numpy as np
cat_used=['survived','pclass','sex','age','sibsp','parch','fare','embarked','title','social status','family members','deck','name length','ticket length','age available', 'fare available']
data_used=pd.DataFrame()
for cat in cat_used:
    data_used[cat_used] = data[cat_used]

data_used.columns.values
Out[88]:
array(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'title', 'social status', 'family members', 'deck',
       'name length', 'ticket length', 'age available', 'fare available'],
      dtype=object)

We have dropped the "cabin" column along with any other column not used.

Categorical data needs to be transformed into one-hot vector representation for the model. Assigning an integer to each category does not work because we cannot do number operations on categorical data. Rather, we need an indicator for these data, which is what one-hot vectors are best suited for.

In [89]:
categorical_data = ['sex','pclass','embarked','title','social status','deck','age available', 'fare available']

for cat in categorical_data:
    data_used = pd.concat((data_used, pd.get_dummies(data_used[cat], prefix = cat)), axis = 1)
    data_used.drop(cat, inplace=True, axis=1)

We split our data into test data and training data.

In [90]:
test_data_split = 0.3
msk = np.random.rand(len(data_used)) < test_data_split 

test = data_used[msk]
train = data_used[~msk]

We further split our data into output labels and input data.

In [91]:
Y_train = train['survived']
X_train = train.drop(['survived'], axis=1)

Y_test = test['survived']
X_test = test.drop(['survived'], axis=1)

For neural network and regression models, since we are doing mathematical operations on the input data, we need to normalize the input data to similar ranges, so that no single category data will dominate the calculated result. We use the MinMaxScaler() function in sklearn to scale our inputs to (0,1).

In [ ]:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_test_minmax = min_max_scaler.fit_transform(X_test)

4. Model Definition and Training

We are finally ready to define our machine learning model and do some prediction. We compare the prediction accuracy of several classes of binary classification model.

4.1 Baseline Model

For our baseline, we will always predict that the passenger will not survive. We will compare the accuracy of this prediciton with the predictions of our machine learning models.

In [104]:
Y_pred_base=np.zeros((len(Y_test)))
compare=Y_pred_base==Y_test
acc_base = sum(compare)/len(compare)
print('Base Line Test Accuracy = ',acc_base)
Base Line Test Accuracy =  0.6515580736543909

Our baseline prediction is pretty good, with an accuracy of 65%.

4.2 Decision Tree

We briefly introduce decision trees.

  • Nonlinear classification model
  • Decision trees generate an approximate solution via greedy, top-down, recursive partitioning.

    • Greedy: grow the tree by recursively splitting the samples in the leaf $R_i$ according to $X_j > s$, such that $(R_i , X_j , s)$ maximize the drop in entropy.
    • The method is top-down because we start with the original input space X and split it into two child regions by thresholding on a single feature. We then take one of these child regions and can partition via a new threshold.
    • We continue the training of our model in a recursive manner, always selecting a leaf node, a feature, and a threshold to form a new split.
  • Stopping criteria we could use to determine when to halt the growth of a tree:

    • The simplest criteria involves ”fully” growning the tree: we continue until each leaf region contains exactly one training data point.
    • This technique however leads to a high variance and low bias model, and we therefore turn to various stopping heuristics for regularization. Some common ones include:
      • Minimum Leaf Size – Do not split R if its cardinality falls below a fixed threshold.
      • Maximum Depth – Do not split R if more than a fixed threshold of splits were already taken to reach R.
      • Maximum Number of Nodes – Stop if a tree has more than a fixed threshold of leaf nodes

The benefit of decision trees is that it can be easily visualized. At each node, we know which feature is being used to decide the splits. The disadvantage is that its a greedy process, and will not perform as well as some of the other models.

4.2.1 Hyper-parameter Tuning

There are several hyper-parameters we can tune to improve performance. Lets see them

In [167]:
dt_classifier.get_params().keys()
Out[167]:
dict_keys(['class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'presort', 'random_state', 'splitter'])

Lets scan through the "min_samples_split" and "min_samples_leaf" to maximize accuracy.

In [192]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

dt_classifier = DecisionTreeClassifier()
param_grid = { "min_samples_split" : [ 2,3,4,5,6,7], "min_samples_leaf": [1,2,5,10,20,30,40,50,60,70,80,90,100]}

grid_search = GridSearchCV(estimator=dt_classifier, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1)
grid_search = grid_search.fit(X_train, Y_train)
print(grid_search.best_score_)
print(grid_search.best_params_)
0.6234309623430963
{'min_samples_leaf': 80, 'min_samples_split': 2}

We see that our best score lies within the middle of our scan range. If the best score lies at either end, we must increase our range and scan again. Also, we can scan another round in finer grainarities around our best parameter .

In [195]:
param_grid = { "min_samples_split" : [2], "min_samples_leaf": range(70,90)}

grid_search = GridSearchCV(estimator=dt_classifier, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1)
grid_search = grid_search.fit(X_train, Y_train)
print(grid_search.best_score_)
print(grid_search.best_params_)
0.6234309623430963
{'min_samples_leaf': 73, 'min_samples_split': 2}

Lets use the best parameters in our model ("min_samples_split"=2 and "min_samples_leaf"=73) and fit using training data. After fitting we will use test data to gauge the accuracy of the model

In [196]:
dt_classifier = DecisionTreeClassifier(min_samples_split=2, min_samples_leaf=73)
dt_classifier.fit(X_train, Y_train)
dt_classifier.score(X_train,Y_train)
Y_pred_dt = dt_classifier.predict(X_test)
compare=Y_pred_dt==Y_test
acc_dt = sum(compare)/len(compare)
print('Decision Tree Test Accuracy = ',acc_dt)
Decision Tree Test Accuracy =  0.8101983002832861

The accuracy of the model in our is 81%.

4.2.2 Visualization

Lets generate the decision tree used by our model and visualize how the features are used to generate a prediction

In [197]:
from sklearn.externals.six import StringIO    
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(dt_classifier, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,
                feature_names = X_test.columns.values)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png("decision.png")
Out[197]:
True

Decision Tree

The model uses the "title", which is an engineered feature, at the top of the tree. The title of the passenger not only includes information on sex, but also soci-economic status of the passenger. We are very happy that a engineered feature is the most useful in the model. The second level uses "deck" and "passenger class". Deck is another engineered feature. So deck information of a passenger indeed is a useful feature to determine rate survival. However, this is used in unexpected ways, that is, if deck information is available or not.

We will generate the "importance" of each feature below in detail below.

In [203]:
pd.concat((pd.DataFrame(X_train.columns, columns = ['variable']), 
           pd.DataFrame(dt_classifier.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False)[:10]
Out[203]:
variable importance
28 title_Mr 0.662732
11 pclass_3 0.198602
46 deck_NA 0.079498
15 embarked_S 0.029184
3 fare 0.018941
5 name length 0.011042
35 social status_Common 0.000000
37 social status_Military 0.000000
36 social status_Dr 0.000000
0 age 0.000000

As one can see, "title", "passenger class", "assignment of deck", "embarked port", and "fare" are the most important features to indicate survival. With the engineered feature of "title" being overwhelmingly the most important.

We can see from decision tree that indeed ones sex and wealth played an important role in one's survival rate. Age is not a good indicator.

The benefit of decision trees is that it allows us visualize and explain our finds. On other machine learning models, such as neural networks, and random forest, obtaining an intuition on how the features are used by the model is very difficult. Although they may be more accurate.

4.3 Random Forest

In random forest, we use many decision trees, each using only part of the data samples.

  • We fit a decision tree to different samples.
    • When growing the tree, we select a random sample of m < M predictors to consider in each step.
    • This will lead to very different (or “uncorrelated”) trees from each sample.
    • Finally, average the prediction of each tree.
  • Each tree is grown as follows:
    1. If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree.
    2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
    3. Each tree is grown to the largest extent possible. There is no pruning.

4.3.1 Hyper-Parameter Tuning

We first scan through some hyper-parameters.

In [98]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier(criterion= 'gini', min_samples_leaf=1, max_features='auto', oob_score=True, random_state=1, n_jobs=-1)

param_grid = { "min_samples_split" : [ 12, 14, 16], "n_estimators": [40,50, 60, 80]}

grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1)
grid_search = grid_search.fit(X_train, Y_train)

print(grid_search.best_score_)
print(grid_search.best_params_)
0.5366108786610879
{'min_samples_split': 14, 'n_estimators': 60}

Once the grid search for hyperparameter is done, we choose the best one and train our model

In [99]:
rf_classifier = RandomForestClassifier(n_estimators=60, max_features='auto', criterion='gini', min_samples_split=14, min_samples_leaf = 1, oob_score=True, random_state=1, n_jobs=-1)
rf_classifier.fit(X_train, Y_train)
rf_classifier.oob_score_

pd.concat((pd.DataFrame(X_train.columns, columns = ['variable']), 
           pd.DataFrame(rf_classifier.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False)[:10]

Y_pred_rf = rf_classifier.predict(X_test)
compare=Y_pred_rf==Y_test
acc_rf = sum(compare)/len(compare)
print('Random Forest Test Accuracy = ',acc_rf)
Random Forest Test Accuracy =  0.8271954674220963

4.3.2 Visualization

Lets look at the decision imprtance of the features used.

In [204]:
pd.concat((pd.DataFrame(X_train.columns, columns = ['variable']), 
           pd.DataFrame(rf_classifier.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False)[:10]
Out[204]:
variable importance
28 title_Mr 0.123850
3 fare 0.106428
8 sex_male 0.100436
7 sex_female 0.093519
5 name length 0.077640
0 age 0.066252
25 title_Miss 0.052551
11 pclass_3 0.045043
4 family members 0.044770
6 ticket length 0.043136

Like the decision tree model, the title is the most important feature considered. Fare of the passengers follows. Again, an engineered feature is the most important here. This shows the importance of feature engineering.

4.4 Logistic Regression

Logic regression combines features together in a linear fashion to predict the outcome. Each feature is weighted differently and summed together. The optimum weights are trained using gradient descent.

In linear regression, the inputs should be normalized to the same range. This is so that no single feature will dominate the weighted combinations.

4.4.1 Hyper-Parameter Tuning

We will not tune any hyper-parameters for linear regression

In [101]:
from sklearn.linear_model import LogisticRegression

lr_classifierlogreg = LogisticRegression()
lr_classifierlogreg.fit(X_train_minmax, Y_train)

lr_classifierlogreg.score(X_train_minmax, Y_train)
Y_pred_lr = lr_classifierlogreg.predict(X_test_minmax)
compare=Y_pred_lr==Y_test
acc_lr = sum(compare)/len(compare)
print('Linear Regression Test Accuracy = ',acc_lr)
Linear Regression Test Accuracy =  0.8101983002832861
C:\Users\1\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Our prediction accuracy is 81%

4.4.2 Visualization

Only the weights are available for us to examine. But in general gaining intuition from them is difficult. Thus we do not visualize the weights here

4.5 Neural Network

Neural network allows the non-linear combination of features to predict an outcome. This is achieved using non=linear activations such as "relu". Furthermore, we can have multiple layers of neural networks and re-combine the already combined features again. This makes neural networks more powerful than logistic regression. However, for simple datasets like the Titanic dataset, neural network will not signifcantly out perform other models.

4.5.1 Hyper-Parameter Tuning

Many parameters can be tuned in neural networks. This include the number of layers, size of each layer, activation of each layer, among others. For sake of simplicity, we will not tune hyper-parameters here. This is a topic that is too complicated to be covered here.

In [150]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import plot_model

model = Sequential()

# Add new layers
model.add(Dense(units=20, activation='linear',input_dim=52))
model.add(Dense(units=10, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))
print(model.summary())
plot_model(model, to_file='model.png',show_shapes=True)
model.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_46 (Dense)             (None, 20)                1060      
_________________________________________________________________
dense_47 (Dense)             (None, 10)                210       
_________________________________________________________________
dense_48 (Dense)             (None, 1)                 11        
=================================================================
Total params: 1,281
Trainable params: 1,281
Non-trainable params: 0
_________________________________________________________________
None

4.4.2 Visualization

We can visualize our model here.

In [151]:
history = model.fit(x=X_train_minmax,
                    y=Y_train, 
                    batch_size=128, 
                    epochs=50, 
                    verbose=0, 
                    validation_split=0.2, 
                    shuffle=True)

Training a neural network model involves many iterations or epochs. The accuracy and error (loss) increase and decrease over time respectively.

In [152]:
import matplotlib.pyplot as plt
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='lower right')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()
dict_keys(['val_loss', 'val_acc', 'loss', 'acc'])
In [153]:
Y_pred_nn = model.predict(X_test_minmax)
compare=np.squeeze(np.round(Y_pred_nn))==np.array(Y_test)
acc_nn = sum(compare)/len(compare)
print('Neural Network Test Accuracy = ',acc_nn)
Neural Network Test Accuracy =  0.8215297450424929

Our model accuracy is 82%.

5. Model Comparison

We compare the test accuracy of our models below. For simple datasets, all models perform equally well. In particular, simple models such as decision tree is able to perform well when we feed it engineered features.

Model Test Accuracy
Baseline 0.6512
Decision Tree 0.8102
Random Forest 0.8272
Logistic Regression 0.8102
Neural Network 0.8215
In [205]:
%%HTML
<style>
div.prompt {display:none} #div.prompt {display:""} div.prompt {display:none} 
</style>