Datasets
Contents
7.1. Datasets#
7.1.1. the iris
dataset#
from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
7.1.2. The breast_cancer
dataset#
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
7.1.3. the Horse Colic Dataset#
The data is from the UCI database. The data is loaded as follows. ?
represents missing data.
import pandas as pd
import numpy as np
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.data'
df = pd.read_csv(url, delim_whitespace=True, header=None)
df = df.replace("?", np.NaN)
The description of the data is listed here. We will preprocess the data according to the descrption.
The data tries to predict Column 24. Since Python index starts from 0, in our case we are interested in Column 23.
Column 25-27 (in our case is Column 24-26) use a special code to represent the type of lesion. For simplicity we remove these three columns.
Column 28 (in our case Column 27) is of no significance so we will remove it too.
Column 3 (in our case Column 2) is the IDs of Hospitals which should have very little impact so we will remove it too.
We will fill the missing values with
0
.We also would like to change the label from
1
and2
to0
and1
for the purpose of Logistic regression.
This part should be modified if you want to improve the performance of your model.
df.fillna(0, inplace=True)
df.drop(columns=[2, 24, 25, 26, 27], inplace=True)
df[23].replace({1: 1, 2: 0}, inplace=True)
X = df.iloc[:, :-1].to_numpy().astype(float)
y = df[23].to_numpy().astype(int)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
7.1.4. the Dating dataset#
The data file can be downloaded from here
.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('datingTestSet2.txt', sep='\t', header=None)
X = np.array(df[[0, 1, 2]])
y = np.array(df[3])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
7.1.5. The dataset randomly generated#
make_moon
dataset
This is used to generate two interleaving half circles.
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
make_gaussian_quantiles
dataset
This is a generated isotropic Gaussian and label samples by quantile.
The following code are from this page. It is used to generate a relative complex dataset by combining two datesets together.
from sklearn.datasets import make_gaussian_quantiles
from sklearn.model_selection import train_test_split
X1, y1 = make_gaussian_quantiles(cov=2.0, n_samples=200, n_features=2,
n_classes=2, random_state=1)
X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5, n_samples=300,
n_features=2, n_classes=2, random_state=1)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, -y2 + 1))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
It can also be used to generate multiclass dataset.
from sklearn.datasets import make_gaussian_quantiles
from sklearn.model_selection import train_test_split
X, y = make_gaussian_quantiles(cov=2.0, n_samples=200, n_features=2,
n_classes=4, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
make_classification
This will create a multiclass dataset. Without shuffling, X
horizontally stacks features in the following order: the primary n_informative
features, followed by n_redundant
linear combinations of the informative features, followed by n_repeated
duplicates, drawn randomly with replacement from the informative and redundant features.
For more details please see the official document.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=10, n_informative=2, n_redundant=2, n_repeated=2, n_classes=3, n_clusters_per_class=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
7.1.6. MNIST
dataset#
There are several versions of the dataset.
tensorflow
provides the data with the original split.
import tensorflow.keras as keras
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()
7.1.7. titanic
dataset#
This is the famuous Kaggle101 dataset. The original data can be download from the Kaggle page. You may also download the training data
and the test data
by click the link.
import pandas as pd
dftrain = pd.read_csv('train.csv')
dftest = pd.read_csv('test.csv')
The original is a little bit messy with missing values and mix of numeric data and string data. The above code reads the data into a DataFrame. The following code does some basic of preprocess. This part should be modified if you want to improve the performance of your model.
Only select columns:
Pclass
,Sex
,Age
,SibSp
,Parch
,Fare
. That is to say,Name
,Cabin
andEmbarked
are dropped.Fill the missing values in column
Age
andFare
by0
.Replace the column
Sex
by the following map:{'male': 0, 'female': 1}
.
import pandas as pd
import numpy as np
def getnp(df):
df['mapSex'] = df['Sex'].map(lambda x: {'male': 0, 'female': 1}[x])
dfx = df[['Pclass', 'mapSex', 'Age', 'SibSp', 'Parch', 'Fare']].copy()
dfx['Fare'].fillna(0, inplace=True)
dfx['Age'].fillna(0, inplace=True)
if 'Survived' in df.columns:
y = df['Survived'].to_numpy()
else:
y = None
X = dfx.to_numpy()
return (X, y)
X_train, y_train = getnp(dftrain)
X_test, _ = getnp(dftest)
For the purpose of submitting to Kaggle, after getting y_pred
, we could use the following file to prepare for the submission file.
def getdf(df, y):
df['Survived'] = y
return df[['PassengerId', 'Survived']]
getdf(dftest, y_pred).to_csv('result.csv')
7.1.8. Plant survival data with salt and microbe treatments#
This dataset is supported by DART SEED grant. It is provided by Dr. Suresh Subedi from ATU. The dataset is about the outcomes of certain treatments applied to plants. We would like to predict whether the plants survive based on the status of the plants and the treatments. The datafile can be downloaded from here
.
We could use the following code to read the data.
import pandas as pd
df = pd.read_excel('assests/datasets/plants.xlsx', engine='openpyxl', sheet_name='data')
There are a few missing values. The missing values in Outcome_after 12 months
are all dead
. These are not recorded as dead
because the cause of the death is more complicated and needs to be studied separatedly. In our case we could simply fill it with dead
.
There are two more missing values in Stem diameter
. For simplicity we drop them directly.
df['Outcome_after 12 months'].fillna('dead', inplace=True)
df = df.dropna()