import pandas as pd
train_df = pd.read_csv("assests/datasets/adult/adult.data", na_values=" ?", header=None).dropna()
test_df = pd.read_csv("assests/datasets/adult/adult.test", na_values=" ?", skiprows=[0], header=None).dropna()
colnames = [
"age",
"workclass",
"fnlwgt",
"education",
"education-num",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"capital-gain",
"capital-loss",
"hours-per-week",
"native-country",
"income",
]
train_df.columns = colnames
test_df.columns = colnames1 Adult income dataset
This dataset is from UCI machine learning repository. The dataset is about the annual income of an individual and the purpose of the project is to predict whether the annual income exceeds $50K/year or not.
1.1 Reading data
- There are two dataset files.
adult.datais the training dataset andadult.testis the test dataset. - Both files don’t have header lines. So we need to manually specify column names.
- The first line of
adult.testshould be skipped since it is irrelevent to our dataset. - In this dataset missing values are marked as
?. We would like to remove all missing values.
1.2 Preprocessing
There are 15 columns. The last column is income which has two string values ' >50K' and ' <=50K'. The rest variables are descibed below.
age: continuousworkclass: 8 categoriesfnlwgt: continuouseducation: 16 categorieseducation-num: continuousmarital-status: 7 categoriesoccupation: 14 categoriesrelationship: 6 categoriesrace: 5 categoriessex: 2 categoriescapital-gain: continuouscapital-loss: continuoushours-per-week: continuousnative-country: 41 categories
So for features we need to normalize numerical columns and apply one-hot encoding for categorical columns. For labels, we need to convert ' >50K' and ' <=50K' into 1 and 0.
1.2.1 Split the label column
1.2.2 Normalize
1.2.3 One-hot encoding
from sklearn.preprocessing import OneHotEncoder
categorical_features = [
"workclass",
"education",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"native-country",
]
ohe = OneHotEncoder(handle_unknown="ignore")
categorical_X_train = ohe.fit_transform(X_train[categorical_features])
categorical_X_test = ohe.transform(X_test[categorical_features])The onehot decoding of the categorical columns is relative large, and is therefore stored as a sparse matrix (which is an object different from numpy array). So we need to convert it back to numpy array.
1.2.4 Concatenate
1.2.5 Composer
The best pratice for the above operations is to write a composer.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
preprocess = ColumnTransformer(
transformers=[
("num", StandardScaler(), numeric_features),
("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
]
)
X_train_norm = preprocess.fit_transform(X_train)
X_test_norm = preprocess.transform(X_test)The results here are also sparse matrix. So we need to convert it back to dense array.
1.3 Dataset class
We build a Dataset class in order to use it for a nerual network.
import torch
from torch.utils.data import Dataset
class AdultDataset(Dataset):
def __init__(self, X, y):
self.X = torch.tensor(X, dtype=torch.float32)
self.y = torch.tensor(y, dtype=torch.long).reshape(-1, 1)
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
train_ds = AdultDataset(X_train_norm, y_train)
test_ds = AdultDataset(X_test_norm, y_test)