4  Ensemble methods

After we get some relatively simple classifiers (sometimes also called weak classifiers), we might put them together to form a more complicated classifier. This type of methods is called an ensemble method. The basic way to ``ensemble’’ classifiers together to through the voting machine.

There are mainly two ways to generate many classifiers.

4.1 Bootstrap aggregating

4.1.1 Basic bagging

One approach to get many estimators is to use the same training algorithm for every predictor and train them on different random subsets of the training set. When sampling is performed with replacement, this method is called bagging (short for bootstrap aggregating). When sampling is performed without replacement, it is called pasting.

Consider the following example. The dataset is the one we used in Chpater 3: make_moon. We split the dataset into training and test sets.

from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)
plt.scatter(x=X[:, 0], y=X[:, 1], c=y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

We would like to sample from the dataset to get some smaller minisets. We will use sklearn.model_selection.ShuffleSplit to perform the action.

The output of ShuffleSplit is a generator. To get the index out of it we need a for loop. You may check out the following code.

Note that ShuffleSplit is originally used to shuffle data into training and test sets. We would only use the shuffle function out of it, so we will set test_size to be 1 and use _ later in the for loop since we won’t use that part of the information.

What we finally get is a generator rs that produces indexes of subsets of X_train and y_train.

from sklearn.model_selection import ShuffleSplit
n_trees = 1000
n_instances = 100
rs = ShuffleSplit(n_splits=n_trees, test_size=1, train_size=n_instances).split(X_train)

sklearn provides BaggingClassifier to directly perform bagging or pasting. The code is as follows.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(),
                            n_estimators=1000,
                            max_samples=1.0,
                            bootstrap=True)

In the above code, bag_clf is a bagging classifier, made of 500 DecisionTreeClassifers, and is trained over subsets of size 1.0. The option bootstrap=True means that it is bagging. If you would like to use pasting, the option is bootstrap=False.

This bag_clf also has .fit() and .predict() methods. It is used the same as our previous classifiers. Let us try the make_moon dataset.

from sklearn.metrics import accuracy_score

bag_clf.fit(X_train, y_train)
y_pred_bag = bag_clf.predict(X_test)
accuracy_score(y_pred_bag, y_test)
0.8446666666666667

4.1.2 max_samples

The original bagging algoithm requires the “subset” has the same number of datapoints as the original set. Since duplicates are allowed, we are expected to use 63% of the data from the original dataset.

Click to expand for more details.

Let the original dataset have \(N\) elements. The probability to pick any specific data is \(1/N\), and therefore the probability not to pick it is \(1-1/N\). If we pick \(N\) elements with replacement, the probability that we never pick a specific element is \[ (1-\frac1N)^N\rightarrow \mathrm e^{-1},\quad \text{when }N\rightarrow\infty. \] Therefore when \(N\) is large, the probability for each element not to be picked is approximately \(\mathrm e^{-1}\), and then the probability for it to be picked is approximately \(1-\mathrm e^{-1}\).

Let \(I_i\) be the random variable representing whether the element \(i\) is picked or not. So \[ I_i=\begin{cases}1&i \text{ is picked}\\ 0&i\text{ is not picked}\end{cases}. \] So \[ P(I_i=1)=1-\mathrm e^{-1},\quad P(I_i=0)=\mathrm e^{-1},\quad \text{ and }E[I_i]=1\cdot(1-\mathrm e^{-1})+0\cdot(\mathrm e^{-1})=1-\mathrm e^{-1}. \] The expectation of the final sample size proportion is \[ E[\frac1N\sum_{i=1}^NI_i]=\frac1N\sum_{i=1}^IE[I_i]=1-\mathrm e^{-1}\approx 0.63. \]

With similar calculation, if we pick \(\alpha N\) data, the expectation of the final sample size proportion is \(1-\mathrm e^{-\alpha}\).

In sklearn, the requiremnt is relaxed, that you may have any number of data in each sample. This gives more flexiblity to adjust your model, that the sample size may affect the performance of the model. It is set by the argument max_samples.

  • If it is float that between 0 and 1, it will be used as the proportion. The default value is \(1.0\) which means that we use the original size as the sample size.
  • IF it is an integer that is bigger than \(1\), it will be used as the fixed sample size, no matter what the orginal data size is.

4.1.3 OOB score

When we use bagging, it is possible that some of the training data are not used. In this case, we could record which data are not used, and just use them as the test set, instead of providing extra data for test. The data that are not used is called out-of-bag instances, or oob for short. The accuracy over the oob data is called the oob score.

We could set oob_score=True to enable the function when creating a BaggingClassifier, and use .oob_score_ to get the oob score after training.

bag_clf_oob = BaggingClassifier(DecisionTreeClassifier(),
                                n_estimators=1000,
                                bootstrap=True,
                                oob_score=True)
bag_clf_oob.fit(X_train, y_train)
bag_clf_oob.oob_score_
0.8398823529411765

4.1.4 Random Forests

When the classifiers used in a bagging classifier are all Decision Trees, the bagging classifier is called a random forest. sklearn provide RandomForestClassifier class. It is almost the same as BaggingClassifier + DecisionTreeClassifer.

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=1000, random_state=1)
rnd_clf.fit(X_train, y_train)
rnd_clf.score(X_test, y_test)
0.844

When we use the Decision Tree as our base estimators, the class RandomForestClassifier provides more control over growing the random forest, with a certain optimizations. If you would like to use other estimators, then BaggingClassifier should be used.

4.1.5 Extra-trees

When growing a Decision Tree, our method is to search through all possible ways to find the best split point that get the lowest Gini impurity. Anohter method is to use a random split. Of course a random tree performs much worse, but if we use it to form a random forest, the voting system can help to increase the accuracy. On the other hand, random split is much faster than a regular Decision Tree.

This type of forest is called Extremely Randomized Trees, or Extra-Trees for short. We could modify the above random forest classifier code to implement the extra-tree algorithm. The key point is that we don’t apply the Decision Tree algorithm. Instead we perform a random split: select a random feature, and split at a random value. In sklearn there is an ExtraTreesClassifier to create such a classifier.

Note that for Extra-Tree, we use the whole dataset instead of samples, because the random split already ensures diversity so we need to reduce variance from resampling. So technically Extra-Tree is not a Bagging method in the standard way, although it looks like one.

from sklearn.ensemble import ExtraTreesClassifier

ext_clf = ExtraTreesClassifier(n_estimators=1000, random_state=1)
ext_clf.fit(X_train, y_train)
ext_clf.score(X_test, y_test)
0.8466666666666667

In the above example, RandomForestClassifier and ExtraTreesClassifier get similar accuracy. However from the code below, you will see that in this example ExtraTreesClassifier is much faster than RandomForestClassifier.

from time import time

t0 = time()
rnd_clf = RandomForestClassifier(n_estimators=1000, random_state=1)
rnd_clf.fit(X_train, y_train)
t1 = time()
print("Random Frorest: {}".format(t1 - t0))

t0 = time()
ext_clf = ExtraTreesClassifier(n_estimators=1000, random_state=1)
ext_clf.fit(X_train, y_train)
t1 = time()
print("Extremely Randomized Trees: {}".format(t1 - t0))
Random Frorest: 11.967942476272583
Extremely Randomized Trees: 5.601980447769165

4.1.6 Gini importance

After training a Decision Tree, we could look at each node. Each split is against a feature, which decrease the Gini impurity the most. In other words, we could say that the feature is the most important during the split.

Using the average Gini impurity decreased as a metric, we could measure the importance of each feature. This is called Gini importance. If the feature is useful, it tends to split mixed labeled nodes into pure single class nodes.

In the case of random forest, since there are many trees, we might compute the weighted average of the Gini importance across all trees. The weight depends on how many times the feature is used in a specific node.

Using RandomForestClassifier, we can directly get access to the Gini importance of each feature by .feature_importance_. Please see the following example.

rnd_clf.fit(X_train, y_train)
rnd_clf.feature_importances_
array([0.46353558, 0.53646442])

In this example, you may see that the two features are relavely equally important, where the second feature is slightly more important since on average it decrease the Gini impurity a little bit more.

4.2 Voting machine

4.2.1 Voting classifier

Assume that we have several trained classifiers. The easiest way to make a better classifer out of what we already have is to build a voting system. That is, each classifier give its own prediction, and it will be considered as a vote, and finally the highest vote will be the prediction of the system.

In sklearn, you may use VotingClassifier. It works as follows.

from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

clfs = [('knn', KNeighborsClassifier(n_neighbors=5)),
        ('dt', DecisionTreeClassifier(max_depth=2))]
voting_clf = VotingClassifier(estimators=clfs, voting='hard')

All classifiers are stored in the list clfs, whose elements are tuples. The syntax is very similar to Pipeline. What the classifier does is to train all listed classifiers and use the majority vote to predict the class of given test data. If each classifier has one vote, the voting method is hard. There is also a soft voting method. In this case, every classifiers not only can predict the classes of the given data, but also estimiate the probability of the given data that belongs to certain classes. On coding level, each classifier should have the predict_proba() method. In this case, the weight of each vote is determined by the probability computed. In our course we mainly use hard voting.

Let us use make_moon as an example. We first load the dataset.

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

We would like to apply kNN model. As before, we build a data pipeline pipe to first apply MinMaxScaler and then KNeighborsClassifier.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipe = Pipeline(steps=[('scalar', MinMaxScaler()),
                       ('knn', KNeighborsClassifier())])
parameters = {'knn__n_neighbors': list(range(1, 51))}
gs_knn = GridSearchCV(pipe, param_grid=parameters) 
gs_knn.fit(X_train, y_train)
clf_knn = gs_knn.best_estimator_
clf_knn.score(X_test, y_test)
0.8666666666666667

The resulted accuracy is shown above.

We then try it with the Decision Tree.

from sklearn.tree import DecisionTreeClassifier

gs_dt = GridSearchCV(DecisionTreeClassifier(), param_grid={'max_depth': list(range(1, 11)), 'max_leaf_nodes': list(range(10, 30))})
gs_dt.fit(X_train, y_train)
clf_dt = gs_dt.best_estimator_
clf_dt.score(X_test, y_test)
0.8613333333333333

We would also want to try Logistic regression method. This will be covered in the next Chapter. At current stage we just use the default setting without changing any hyperparameters.

from sklearn.linear_model import LogisticRegression
clf_lr = LogisticRegression()
clf_lr.fit(X_train, y_train)
clf_lr.score(X_test, y_test)
0.8393333333333334

Now we use a voting classifier to combine the results.

from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
clfs = [('knn', KNeighborsClassifier()),
        ('dt', DecisionTreeClassifier()),
        ('lr', LogisticRegression())]
voting_clf = VotingClassifier(estimators=clfs, voting='hard')
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test)
0.85

You may compare the results of all these four classifiers. The voting classifier is not guaranteed to be better. It is just a way to form a model.

4.3 AdaBoost

This is the first algorithm that successfully implements the boosting idea. AdaBoost is short for Adaptive Boosting.

4.3.1 Weighted dataset

We firstly talk about training a Decision Tree on a weighted dataset. The idea is very simple. When building a Decision Tree, we use some method to determine the split. In this course the Gini impurity is used. There are at least two other methods: cross-entropy and misclassified rate. For all three, the count of the elemnts in some classes is the essnetial part. To train the model over the weighted dataset, we just need to upgrade the count of the elements by the weighted count.

Example 4.1 Consider the following data:

x0 x1 y Weight
1.0 2.1 + 0.5
1.0 1.1 + 0.125
1.3 1.0 - 0.125
1.0 1.0 - 0.125
2.0 1.0 + 0.125

The weighted Gini impurity is

\[ \text{WeightedGini}=1-(0.5+0.125+0.125)^2-(0.125+0.125)^2=0.375. \]

You may see that the original Gini impurity is just the weighted Gini impurity with equal weights. Therefore the first tree we get from AdaBoost (see below) is the same tree we get from the Decision Tree model in Chpater 3.

4.3.2 General process

Here is the rough description of AdaBoost.

  1. Assign weights to each data point. At the begining we could assign weights equally.
  2. Train a classifier based on the weighted dataset, and use it to predict on the training set. Find out all wrong answers.
  3. Adjust the weights, by inceasing the weights of data points that are done wrongly in the previous generation.
  4. Train a new classifier using the new weighted dataset. Predict on the training set and record the wrong answers.
  5. Repeat the above process to get many classifiers. The training stops either by hitting \(0\) error rate, or after a specific number of rounds.
  6. The final results is based on the weighted total votes from all classifiers we trained.

Now let us talk about the details. Assume there are \(N\) data points. Then the inital weights are set to be \(\dfrac1N\). There are 2 sets of weights. Let \(w^{(i)}\) be weights of the \(i\)th data points. Let \(\alpha_j\) be the weights of the \(j\)th classifier. After training the \(j\)th classifier, the error rate is denoted by \(e_j\). Then we have

\[ e_j=\frac{\text{the total weights of data points that are misclassified by the $j$th classifier}}{\text{the total weights of data points}} \]

\[ \alpha_j=\eta\ln\left(\dfrac{1-e_j}{e_j}\right). \]

\[ w^{(i)}_{\text{new}}\leftarrow\text{normalization} \leftarrow w^{(i)}\leftarrow\begin{cases}w^{(i)}&\text{if the $i$th data is correctly classified,}\\w^{(i)}\exp(\alpha_j)&\text{if the $i$th data is misclassified.}\end{cases} \]

Note

The first tree is the same tree we get from the regular Decision Tree model. In the rest of the training process, more weights are put on the data that we are wrong in the previous iteration. Therefore the process is the mimic of “learning from mistakes”.

Note

The \(\eta\) in computing \(\alpha_j\) is called the learning rate. It is a hyperparameter that will be specified mannually. It does exactly what it appears to do: alter the weights of each classifier. The default is 1.0. When the number is very small (which is recommended although it can be any positive number), more iterations will be expected.

4.3.3 Example 1: the iris dataset

Similar to all previous models, sklearn provides AdaBoostClassifier. The way to use it is similar to previous models. Note that although we are able to use any classifiers for AdaBoost, the most popular choice is Decision Tree with max_depth=1. This type of Decision Trees are also called Decision Stumps.

In the following examples, we initialize an AdaBoostClassifier with 500 Decision Stumps and learning_rate=0.5.

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=1000,
                             learning_rate=.5)

We will use the iris dataset for illustration. The cross_val_score is calculated as follows.

from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
scores = cross_val_score(ada_clf, X, y, cv=5)
scores.mean()
np.float64(0.9466666666666667)

4.3.4 Example 2: the Horse Colic dataset

This dataset is from UCI Machine Learning Repository. The data is about whether horses survive if they get a disease called Colic. The dataset is preprocessed, and the result can be downloaded here. Note that based on the description, we only use the first 23 columns, and we change the original 3 outcomes (lived, died was euthanized) into 2 outcomes (lived, died) by combining outcome 2 and 3.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

filepath = "assests/datasets/horse_colic_clean.csv"
df = pd.read_csv(filepath)
X = df.iloc[:, :22].to_numpy().astype(float)
y = (df.iloc[:, 22]<2).to_numpy().astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

RANDOMSTATE = 1 
ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1, random_state=RANDOMSTATE),
    random_state=RANDOMSTATE
)

# Hyperparameter grid
param_grid = {
    "n_estimators": [50, 100, 200],
    "learning_rate": [0.01, 0.1, 1.0],
    "estimator__max_depth": [1, 2, 3]
}

# Grid search with cross-validation
grid = GridSearchCV(
    estimator=ada,
    param_grid=param_grid,
    cv=5,
    scoring="accuracy"
)


grid.fit(X_train, y_train)
grid.best_estimator_.score(X_test, y_test)
0.6964285714285714
grid.best_params_
{'estimator__max_depth': 2, 'learning_rate': 0.01, 'n_estimators': 200}

4.4 Exercises

Exercise 4.1 CHOOSE ONE: Please apply the random forest to one of the following datasets.

  • the iris dataset.
  • the dating dataset.
  • the titanic dataset.

Please answer the following questions.

  1. Please use grid search to find the good max_leaf_nodes and max_depth.
  2. Please record the cross-validation score and the OOB score of your model and compare it with the models you learned before (kNN, Decision Trees).
  3. Please find some typical features (using the Gini importance) and draw the Decision Boundary against the features you choose.

Exercise 4.2 Please use the following code to get the mgq dataset.

from sklearn.datasets import make_gaussian_quantiles

X1, y1 = make_gaussian_quantiles(cov=2.0, n_samples=200, n_features=2,
                                 n_classes=2, random_state=1)
X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5, n_samples=300,
                                 n_features=2, n_classes=2, random_state=1)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, -y2 + 1))

Please build an AdaBoost model.

Exercise 4.3 Please use RandomForestClassifier, ExtraTreesClassifier and KNeighbourClassifier to form a voting classifier, and apply to the MNIST dataset.

NoteMNIST

This dataset can be loaded using the following code.

import numpy as np
import requests
from io import BytesIO
r = requests.get('https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz', stream = True) 
data = np.load(BytesIO(r.raw.read()))
X_train = data['x_train']
X_test = data['x_test']
y_train = data['y_train']
y_test = data['y_test']