1.2. Basic setting for Machine learning problems#

Note

We by default assume that we are dealing with a Supervised Classification problem.

1.2.1. Input and output data structure#

Since we are dealing with Supervised Classification problems, the desired solutions are given. These desired solutions in Classification problems are also called labels. The properties that the data are used to describe are called features. Both features and labels are usually organized as row vectors.

Example 1.1

The example is extracted from [3]. There are some sample data shown in the following table. We would like to use these information to classify bird species.

Table 1.1 Bird species classification based on four features#

Weight (g)

Wingspan (cm)

Webbed feet?

Back color

Species

1000.1

125.0

No

Brown

Buteo jamaicensis

3000.7

200.0

No

Gray

Sagittarius serpentarius

3300.0

220.3

No

Gray

Sagittarius serpentarius

4100.0

136.0

Yes

Black

Gavia immer

3.0

11.0

No

Green

Calothorax lucifer

570.0

75.0

No

Black

Campephilus principalis

The first four columns are features, and the last column is the label. The first two features are numeric and can take on decimal values. The third feature is binary that can only be \(1\) (Yes) or \(0\) (No). The fourth feature is an enumeration over the color palette. You may either treat it as categorical data or numeric data, depending on how you want to build the model and what you want to get out of the data. In this example we will use it as categorical data that we only choose it from a list of colors (\(1\) — Brown, \(2\) — Gray, \(3\) — Black, \(4\) — Green).

Then we are able to transform the above data into the following form:

Table 1.2 Vectorized Bird species data#

Features

Labels

\(\begin{bmatrix}1001.1 & 125.0 & 0 & 1 \end{bmatrix}\)

\(1\)

\(\begin{bmatrix}3000.7 & 200.0 & 0 & 2 \end{bmatrix}\)

\(2\)

\(\begin{bmatrix}3300.0 & 220.3 & 0 & 2 \end{bmatrix}\)

\(2\)

\(\begin{bmatrix}4100.0 & 136.0 & 1 & 3 \end{bmatrix}\)

\(3\)

\(\begin{bmatrix}3.0 & 11.0 & 0 & 4 \end{bmatrix}\)

\(4\)

\(\begin{bmatrix}570.0 & 75.0 & 0 & 3 \end{bmatrix}\)

\(5\)

Then the Supervised Learning problem is stated as follows: Given the features and the labels, we would like to find a model that can classify future data.

1.2.2. Parameters and hyperparameters#

A model parameter is internal to the model and its value is learned from the data.

A model hyperparameter is external to the model and its value is set by people.

For example, assume that we would like to use Logistic regression to fit the data. We set the learning rate is 0.1 and the maximal iteration is 100. After the computations are done, we get a the model

\[ y = \sigma(0.8+0.7x). \]

The two cofficients \(0.8\) and \(0.7\) are the parameters of the model. The model Logistic regression, the learning rate 0.1 and the maximal iteration 100 are all hyperparametrs. If we change to a different set of hyperparameters, we may get a different model, with a different set of parameters.

The details of Logistic regression will be discussed in Chapter 5.

1.2.3. Evaluate a Machine Learning model#

Once the model is built, how do we know that it is good or not? The naive idea is to test the model on some brand new data and check whether it is able to get the desired results. The usual way to achieve it is to split the input dataset into three pieces: training set, validation set and test set.

The model is initially fit on the training set, with some arbitrary selections of hyperparameters. Then hyperparameters will be changed, and new model is fitted over the training set. Which set of hyperparameters is better? We then test their performance over the validation set. We could run through a lot of different combinations of hyperparameters, and find the best performance over the validation set. After we get the best hyperparameters, the model is selcted, and we fit it over the training set to get our model to use.

To compare our model with our models, either our own model using other algorithms, or models built by others, we need some new data. We can no longer use the training set and the validation set since all data in them are used, either for training or for hyperparameters tuning. We need to use the test set to evaluate the “real performance” of our data.

To summarize:

  • Training set: used to fit the model;

  • Validation set: used to tune the hyperparameters;

  • Test set: used to check the overall performance of the model.

The validation set is not always required. If we use cross-validation technique for hyperparameters tuning, like sklearn.model_selection.GridSearchCV(), we don’t need a separated validation set. In this case, we will only need the training set and the test set, and run GridSearchCV over the training set. The cross-validation will be discussed in Section 2.2.6.

The sizes and strategies for dataset division depends on the problem and data available. It is often recommanded that more training data should be used. The typical distribution of training, validation and test is \((6:3:1)\), \((7:2:1)\) or \((8:1:1)\). Sometimes validation set is discarded and only training set and test set are used. In this case the distribution of training and test set is usually \((7:3)\), \((8:2)\) or \((9:1)\).

1.2.4. Workflow in developing a machine learning application#

The workflow described below is from [3].

  1. Collect data.

  2. Prepare the input data.

  3. Analyze the input data.

  4. Train the algorithm.

  5. Test the algorithm.

  6. Use it.

In this course, we will mainly focus on Step 4 as well Step 5. These two steps are where the “core” algorithms lie, depending on the algorithm. We will start from the next Chapter to talk about various Machine Learning algorithms and examples.