5 Horse Colic Dataset

The data is from the UCI database. The data is loaded as follows. ? represents missing data.

import pandas as pd
import numpy as np

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.data'
df = pd.read_csv(url, delim_whitespace=True, header=None)
df = df.replace("?", np.NaN)

The description of the data is listed here. We will preprocess the data according to the descrption.

The data tries to predict Column 24. Since Python index starts from 0, in our case we are interested in Column 23.
Column 25-27 (in our case is Column 24-26) use a special code to represent the type of lesion. For simplicity we remove these three columns.
Column 28 (in our case Column 27) is of no significance so we will remove it too.
Column 3 (in our case Column 2) is the IDs of Hospitals which should have very little impact so we will remove it too.
We will fill the missing values with 0.
We also would like to change the label from 1 and 2 to 0 and 1 for the purpose of Logistic regression.

This part should be modified if you want to improve the performance of your model.

df.fillna(0, inplace=True)
df.drop(columns=[2, 24, 25, 26, 27], inplace=True)
df[23].replace({1: 1, 2: 0}, inplace=True)
X = df.iloc[:, :-1].to_numpy().astype(float)
y = df[23].to_numpy().astype(int)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)