5 Horse Colic Dataset
The data is from the UCI database. The data is loaded as follows. ? represents missing data.
The description of the data is listed here. We will preprocess the data according to the descrption.
- The data tries to predict Column 24. Since Python index starts from 0, in our case we are interested in Column 23.
- Column 25-27 (in our case is Column 24-26) use a special code to represent the type of lesion. For simplicity we remove these three columns.
- Column 28 (in our case Column 27) is of no significance so we will remove it too.
- Column 3 (in our case Column 2) is the IDs of Hospitals which should have very little impact so we will remove it too.
- We will fill the missing values with
0. - We also would like to change the label from
1and2to0and1for the purpose of Logistic regression.
This part should be modified if you want to improve the performance of your model.
df.fillna(0, inplace=True)
df.drop(columns=[2, 24, 25, 26, 27], inplace=True)
df[23].replace({1: 1, 2: 0}, inplace=True)
X = df.iloc[:, :-1].to_numpy().astype(float)
y = df[23].to_numpy().astype(int)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)