5 Horse Colic Dataset
The data is from the UCI database. The data is loaded as follows. ?
represents missing data.
The description of the data is listed here. We will preprocess the data according to the descrption.
- The data tries to predict Column 24. Since Python index starts from 0, in our case we are interested in Column 23.
- Column 25-27 (in our case is Column 24-26) use a special code to represent the type of lesion. For simplicity we remove these three columns.
- Column 28 (in our case Column 27) is of no significance so we will remove it too.
- Column 3 (in our case Column 2) is the IDs of Hospitals which should have very little impact so we will remove it too.
- We will fill the missing values with
0
. - We also would like to change the label from
1
and2
to0
and1
for the purpose of Logistic regression.
This part should be modified if you want to improve the performance of your model.
df.fillna(0, inplace=True)
df.drop(columns=[2, 24, 25, 26, 27], inplace=True)
df[23].replace({1: 1, 2: 0}, inplace=True)
X = df.iloc[:, :-1].to_numpy().astype(float)
y = df[23].to_numpy().astype(int)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)