13 titanic
dataset
This is the famuous Kaggle101 dataset. The original data can be download from the Kaggle page. You may also download the training set and the test data by click the link.
The original is a little bit messy with missing values and mix of numeric data and string data. The above code reads the data into a DataFrame. The following code does some basic of preprocess. This part should be modified if you want to improve the performance of your model.
- Only select columns:
Pclass
,Sex
,Age
,SibSp
,Parch
,Fare
. That is to say,Name
,Cabin
andEmbarked
are dropped. - Fill the missing values in column
Age
andFare
by0
. - Replace the column
Sex
by the following map:{'male': 0, 'female': 1}
.
import pandas as pd
import numpy as np
def getnp(df):
df['mapSex'] = df['Sex'].map(lambda x: {'male': 0, 'female': 1}[x])
dfx = df[['Pclass', 'mapSex', 'Age', 'SibSp', 'Parch', 'Fare']].copy()
dfx['Fare'].fillna(0, inplace=True)
dfx['Age'].fillna(0, inplace=True)
if 'Survived' in df.columns:
y = df['Survived'].to_numpy()
else:
y = None
X = dfx.to_numpy()
return (X, y)
X_train, y_train = getnp(dftrain)
X_test, _ = getnp(dftest)
For the purpose of submitting to Kaggle, after getting y_pred
, we could use the following file to prepare for the submission file.