In machine learing, what perhaps is more important than defining a model or a neural network is data exploration. After we get the dataset, that is hopefully representative of the problem we are trying to solve, the first thing we do is to examine the data and see what characteristics and features are present. Are there any problems in the data? Are some categories under-represented? Are there missing values? Are there significant correlations between some features and outcome? Can we easily engineer new features from existing ones that will correlate better with the outcome? Data exploration not only helps to discover and eliminates some potential problems before model training begins, but will also makes our model more accurate by injecting some human intuition into the solution.
In this tutorial, we will use the Titanic passenger survival dataset to illustrate the concept of data exploration and feature engineering. We choose this dataset because it is simple, and thus allows us to focus on engineering the features. We will show how engineered features play an important role in prediction by being the most important feature in machine learning models. Then, we will illustrate compare several methods of binary classification that can used to predict whether a passenger survives the disaster.
The Titanic dataset can be download from this website. Prediction Titanic passenger survival is also a Kaggle challege. Note that because the passenger survival information is public, the Kaggle challenge leaderboard is spammed with people submitting actual real-world data as machine learning results, making it meaningless.
We first load the "titanic.csv" file into a Pandas dataframe. We will immediately print the first few lines of the file.
import pandas as pd
data = pd.read_csv('titanic.csv')
data.head(10)
We immediately notice that the second column "survived" is our prediction result. The "name" column contains not only the first and last name, but also the title, which could be an important indicator of socio-economic status, which could be a factor in the survival probability. The "sibsp" columns contains the number of siblings and spouses a passenger have on the ship. The "parch" column is the number of parent or guardian the passenger have on the ship. The two columns added together is the size of the family together on the ship. The "ticket" column is the ticket number, which is unique for vast majority of the passengers. Some tickets have alphabets in front. The "embarked" column is the port of departure. This could be another indicator of socio-economic status along with the "fare" column. The "boat" and "body" column is information that should not be used to predict survival rate, since it is statistics gathered after the event. These two columns should be dropped. The "cabin" column is very interesting, it contains the deck information and the room number. Which deck the passenger resides in could be an indicator of survival rate, and should be extracted. Room number on the other hand, whithout knowing the actual layout of the ship, is not very useful. Finally, the "home.dest" column shows the home and destination of the passenger, which could be another indicator of socio-economic status.
We also immdediatly see that some data is missing. We need to examine in detail which data is missing and determine the best way to fill in these missing data without skewing the prediction results.
We can use the folloing command to list the categories in the dataset.
print(data.columns.values)
From these categories, will will drop "boat", "body" from the dataset since they cannot be used as predictors and they are also not the outcome. We will also drop the "home.dest" column since we already have many other good indicators of socio-economic status. Keeping the "home.dest" column will not cause harm, we just want to keep our model a simpler.
drop_cat=['boat','body','home.dest']
data.drop(drop_cat, inplace=True, axis=1)
data.head(10)
Let use directly goto the point and see how many passengers survived the disaster.
data['survived'].mean()
About 40% of the passengers survived. We know from historical records that women and children were given priority on the lifeboats. Was this true, and did it reflect in their survival rates? Lets take a look.
print('Survival Rate by Sex')
print(data['survived'].groupby(data['sex']).mean())
print('\n\nSex Ratio of Passengers')
print(data['sex'].value_counts(normalize=True))
We notice that 72% of the female passengers survived, despite making up only 35% of the total passengers. So indeed women had a larger survival rate than men. But what about age? Lets plot the passenger age historgram.
hist = data['age'].hist(bins=30)
Looks like most of the pasengers are in their twenties and thirties. There are lots of children too. How about their survival rate?
data['survived'].groupby(pd.cut(data['age'], 20)).mean().plot(kind='bar')
It does look like children have a higher survival rate. The few elderly on the ship also survived.
How about soci-economic status. Do rich people have a survival rate higher than the poor? Lets look at fare first.
data['survived'].groupby(pd.cut(data['fare'], [0,5,10,20,40,70,100,1000])).mean().plot(kind='bar')
I definitly see a trend here. If you paid more than \$70 for your ticket, your chances of survival are above 60%. If you paied less than \$10, your chances are only at 20%. Lets look at passenger class.
data['survived'].groupby(data['pclass']).mean().plot(kind='bar')
Clearly 1st class passengers have an advantage of 3rd class.
Now that we have explored our dataset we have fairly good idea on what features are good indicators of survival rate. We maybe able to increase the accuracy of our prediction by engineer new features. For example, maybe doctors have higher survival rate, and military personel have lower survival rate. We can engineer additional features from the features that are already given to us. These additional features represent human intuition, and is very difficult for a machine model to discern.
Once we have a good understanding of the dataset we are working with, we can engineer some new features from it to make our predictions more accurate. We first extract the first, last names and the title from the "name" column.
data['first name'] = data['name'].str.split(',|\\.',expand = True)[2] #expand set to True to return a df instead of series
data['first name'] = data['first name'].str.strip() #strip leading and trailing white spaces
data['last name'] = data['name'].str.split(',|\\.',expand = True)[0] #expand set to True to return a df instead of series
data['last name'] = data['last name'].str.strip()
data['title'] = data['name'].str.split(',|\\.',expand = True)[1] #expand set to True to return a df instead of series
data['title'] = data['title'].str.strip()
data['title'].value_counts() #just display the name column summary
We display the titles gathered from the "name" column. Note that aside from the common titles, there are some religions titles ("Rev"), some noble titles ("Jonkheer", "Don", etc), and some military titles ("Capt", etc). We will sort these titles into social status. Our intuition is that the social status of a person have an impact on their survival rate, and can increase our prediction accuracy when used.
status_map={'Capt':'Military',
'Col':'Military',
'Don':'Noble',
'Dona':'Noble',
'Dr':'Dr',
'Jonkheer':'Noble',
'Lady':'Noble',
'Major':'Military',
'Master':'Common',
'Miss':'Common',
'Mlle':'Common',
'Mme':'Common',
'Mr':'Common',
'Mrs':'Common',
'Ms':'Common',
'Rev':'Clergy',
'Sir':'Noble',
'the Countess':'Noble',
}
data['social status'] = data['title'].map(status_map)
What about the size of the family? Would a larger family have more chances of survival?
data['family members'] = data['parch'] + data['sibsp']
What about the deck the passenger resides in. Would a higher deck offer more chances of escape?
data['deck'] = data['cabin'].str.replace('[0-9]','').str.split(' ', expand=True)[0]
#1. delete cabin number, leaving leading letter (deck); 2. since multiple cabins are assigned, just get the first one, they are all on the same deck
What about the lenght of the name, or length of the ticket number? Could they hold some mysteries meaning that we can't discern but a machine model can extract?
data['name length'] = data['name'].apply(lambda x: len(x))
data['ticket length'] = data['ticket'].apply(lambda x: len(x))
Let's see if any of the newly engineered features are good indicators of survival. Lets first look at deck.
data['survived'].groupby(data['social status']).mean().plot(kind='bar')
Well, it looks like if you are nobility, you escaped with high probability. On the other hand, if you are clergy, you may have stayed behind. What about family size?
data['survived'].groupby(data['family members']).mean().plot(kind='bar')
So it looks like smaller families survived with higher rate, parents probably left with their children. Single passengers probably stayed behind more. Interestingly, larger family have lower survival rate. Perhaps they could not all fit into a single lifeboat and chose to stay behind together.
The newly engineered features are indeed good indicators to the survival rate. We will use them as part of our model to increase model accuracy. These features are easily obtained using human intuition, but is hard for a model to generate by itself.
Before we feed our features into any machine model, we need to fill in the missing values and make sure the data format is correct. First, let's see which columns has missing values.
data.isnull().any()
It looks like "age", "fare", "cabin", "embarked", and "deck" have missing values. We must take care filling these values. For categorical data, we can just leave it blank, since "not available" is also a category and may indeed convey some information. Filling a missing categorical data could generate some artificial information which could be harmful to the model. For non-categorical data, we may use the mean value, but we would like to note that this entry is artificiall generated. So we create another column indicating this.
data['age available'] = ~data['age'].isnull()
data['age'] = data['age'].fillna(data['age'].mean())
data['fare available'] = ~data['fare'].isnull()
data['fare'] = data['fare'].fillna(data['fare'].mean())
data['deck'] = data['deck'].fillna('NA')
data['embarked'] = data['embarked'].fillna('NA')
Check once again which column have missing data. We will drop the "cabin" column in the next section
data.isnull().any()
Once all missing data is filled in. We will extract the columns that we will use in our model.
import numpy as np
cat_used=['survived','pclass','sex','age','sibsp','parch','fare','embarked','title','social status','family members','deck','name length','ticket length','age available', 'fare available']
data_used=pd.DataFrame()
for cat in cat_used:
data_used[cat_used] = data[cat_used]
data_used.columns.values
We have dropped the "cabin" column along with any other column not used.
Categorical data needs to be transformed into one-hot vector representation for the model. Assigning an integer to each category does not work because we cannot do number operations on categorical data. Rather, we need an indicator for these data, which is what one-hot vectors are best suited for.
categorical_data = ['sex','pclass','embarked','title','social status','deck','age available', 'fare available']
for cat in categorical_data:
data_used = pd.concat((data_used, pd.get_dummies(data_used[cat], prefix = cat)), axis = 1)
data_used.drop(cat, inplace=True, axis=1)
We split our data into test data and training data.
test_data_split = 0.3
msk = np.random.rand(len(data_used)) < test_data_split
test = data_used[msk]
train = data_used[~msk]
We further split our data into output labels and input data.
Y_train = train['survived']
X_train = train.drop(['survived'], axis=1)
Y_test = test['survived']
X_test = test.drop(['survived'], axis=1)
For neural network and regression models, since we are doing mathematical operations on the input data, we need to normalize the input data to similar ranges, so that no single category data will dominate the calculated result. We use the MinMaxScaler() function in sklearn to scale our inputs to (0,1).
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_test_minmax = min_max_scaler.fit_transform(X_test)
We are finally ready to define our machine learning model and do some prediction. We compare the prediction accuracy of several classes of binary classification model.
For our baseline, we will always predict that the passenger will not survive. We will compare the accuracy of this prediciton with the predictions of our machine learning models.
Y_pred_base=np.zeros((len(Y_test)))
compare=Y_pred_base==Y_test
acc_base = sum(compare)/len(compare)
print('Base Line Test Accuracy = ',acc_base)
Our baseline prediction is pretty good, with an accuracy of 65%.
We briefly introduce decision trees.
Decision trees generate an approximate solution via greedy, top-down, recursive partitioning.
Stopping criteria we could use to determine when to halt the growth of a tree:
The benefit of decision trees is that it can be easily visualized. At each node, we know which feature is being used to decide the splits. The disadvantage is that its a greedy process, and will not perform as well as some of the other models.
There are several hyper-parameters we can tune to improve performance. Lets see them
dt_classifier.get_params().keys()
Lets scan through the "min_samples_split" and "min_samples_leaf" to maximize accuracy.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
dt_classifier = DecisionTreeClassifier()
param_grid = { "min_samples_split" : [ 2,3,4,5,6,7], "min_samples_leaf": [1,2,5,10,20,30,40,50,60,70,80,90,100]}
grid_search = GridSearchCV(estimator=dt_classifier, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1)
grid_search = grid_search.fit(X_train, Y_train)
print(grid_search.best_score_)
print(grid_search.best_params_)
We see that our best score lies within the middle of our scan range. If the best score lies at either end, we must increase our range and scan again. Also, we can scan another round in finer grainarities around our best parameter .
param_grid = { "min_samples_split" : [2], "min_samples_leaf": range(70,90)}
grid_search = GridSearchCV(estimator=dt_classifier, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1)
grid_search = grid_search.fit(X_train, Y_train)
print(grid_search.best_score_)
print(grid_search.best_params_)
Lets use the best parameters in our model ("min_samples_split"=2 and "min_samples_leaf"=73) and fit using training data. After fitting we will use test data to gauge the accuracy of the model
dt_classifier = DecisionTreeClassifier(min_samples_split=2, min_samples_leaf=73)
dt_classifier.fit(X_train, Y_train)
dt_classifier.score(X_train,Y_train)
Y_pred_dt = dt_classifier.predict(X_test)
compare=Y_pred_dt==Y_test
acc_dt = sum(compare)/len(compare)
print('Decision Tree Test Accuracy = ',acc_dt)
The accuracy of the model in our is 81%.
Lets generate the decision tree used by our model and visualize how the features are used to generate a prediction
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(dt_classifier, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,
feature_names = X_test.columns.values)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png("decision.png")
The model uses the "title", which is an engineered feature, at the top of the tree. The title of the passenger not only includes information on sex, but also soci-economic status of the passenger. We are very happy that a engineered feature is the most useful in the model. The second level uses "deck" and "passenger class". Deck is another engineered feature. So deck information of a passenger indeed is a useful feature to determine rate survival. However, this is used in unexpected ways, that is, if deck information is available or not.
We will generate the "importance" of each feature below in detail below.
pd.concat((pd.DataFrame(X_train.columns, columns = ['variable']),
pd.DataFrame(dt_classifier.feature_importances_, columns = ['importance'])),
axis = 1).sort_values(by='importance', ascending = False)[:10]
As one can see, "title", "passenger class", "assignment of deck", "embarked port", and "fare" are the most important features to indicate survival. With the engineered feature of "title" being overwhelmingly the most important.
We can see from decision tree that indeed ones sex and wealth played an important role in one's survival rate. Age is not a good indicator.
The benefit of decision trees is that it allows us visualize and explain our finds. On other machine learning models, such as neural networks, and random forest, obtaining an intuition on how the features are used by the model is very difficult. Although they may be more accurate.
In random forest, we use many decision trees, each using only part of the data samples.
We first scan through some hyper-parameters.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(criterion= 'gini', min_samples_leaf=1, max_features='auto', oob_score=True, random_state=1, n_jobs=-1)
param_grid = { "min_samples_split" : [ 12, 14, 16], "n_estimators": [40,50, 60, 80]}
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1)
grid_search = grid_search.fit(X_train, Y_train)
print(grid_search.best_score_)
print(grid_search.best_params_)
Once the grid search for hyperparameter is done, we choose the best one and train our model
rf_classifier = RandomForestClassifier(n_estimators=60, max_features='auto', criterion='gini', min_samples_split=14, min_samples_leaf = 1, oob_score=True, random_state=1, n_jobs=-1)
rf_classifier.fit(X_train, Y_train)
rf_classifier.oob_score_
pd.concat((pd.DataFrame(X_train.columns, columns = ['variable']),
pd.DataFrame(rf_classifier.feature_importances_, columns = ['importance'])),
axis = 1).sort_values(by='importance', ascending = False)[:10]
Y_pred_rf = rf_classifier.predict(X_test)
compare=Y_pred_rf==Y_test
acc_rf = sum(compare)/len(compare)
print('Random Forest Test Accuracy = ',acc_rf)
Lets look at the decision imprtance of the features used.
pd.concat((pd.DataFrame(X_train.columns, columns = ['variable']),
pd.DataFrame(rf_classifier.feature_importances_, columns = ['importance'])),
axis = 1).sort_values(by='importance', ascending = False)[:10]
Like the decision tree model, the title is the most important feature considered. Fare of the passengers follows. Again, an engineered feature is the most important here. This shows the importance of feature engineering.
Logic regression combines features together in a linear fashion to predict the outcome. Each feature is weighted differently and summed together. The optimum weights are trained using gradient descent.
In linear regression, the inputs should be normalized to the same range. This is so that no single feature will dominate the weighted combinations.
We will not tune any hyper-parameters for linear regression
from sklearn.linear_model import LogisticRegression
lr_classifierlogreg = LogisticRegression()
lr_classifierlogreg.fit(X_train_minmax, Y_train)
lr_classifierlogreg.score(X_train_minmax, Y_train)
Y_pred_lr = lr_classifierlogreg.predict(X_test_minmax)
compare=Y_pred_lr==Y_test
acc_lr = sum(compare)/len(compare)
print('Linear Regression Test Accuracy = ',acc_lr)
Our prediction accuracy is 81%
Only the weights are available for us to examine. But in general gaining intuition from them is difficult. Thus we do not visualize the weights here
Neural network allows the non-linear combination of features to predict an outcome. This is achieved using non=linear activations such as "relu". Furthermore, we can have multiple layers of neural networks and re-combine the already combined features again. This makes neural networks more powerful than logistic regression. However, for simple datasets like the Titanic dataset, neural network will not signifcantly out perform other models.
Many parameters can be tuned in neural networks. This include the number of layers, size of each layer, activation of each layer, among others. For sake of simplicity, we will not tune hyper-parameters here. This is a topic that is too complicated to be covered here.
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import plot_model
model = Sequential()
# Add new layers
model.add(Dense(units=20, activation='linear',input_dim=52))
model.add(Dense(units=10, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))
print(model.summary())
plot_model(model, to_file='model.png',show_shapes=True)
model.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])
We can visualize our model here.
history = model.fit(x=X_train_minmax,
y=Y_train,
batch_size=128,
epochs=50,
verbose=0,
validation_split=0.2,
shuffle=True)
Training a neural network model involves many iterations or epochs. The accuracy and error (loss) increase and decrease over time respectively.
import matplotlib.pyplot as plt
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='lower right')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()
Y_pred_nn = model.predict(X_test_minmax)
compare=np.squeeze(np.round(Y_pred_nn))==np.array(Y_test)
acc_nn = sum(compare)/len(compare)
print('Neural Network Test Accuracy = ',acc_nn)
Our model accuracy is 82%.
We compare the test accuracy of our models below. For simple datasets, all models perform equally well. In particular, simple models such as decision tree is able to perform well when we feed it engineered features.
Model | Test Accuracy |
---|---|
Baseline | 0.6512 |
Decision Tree | 0.8102 |
Random Forest | 0.8272 |
Logistic Regression | 0.8102 |
Neural Network | 0.8215 |
%%HTML
<style>
div.prompt {display:none} #div.prompt {display:""} div.prompt {display:none}
</style>