from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
= make_moons(n_samples=10000, noise=0.4, random_state=42)
X, y =X[:, 0], y=X[:, 1], c=y)
plt.scatter(x= train_test_split(X, y, test_size=0.15) X_train, X_test, y_train, y_test
4 Ensemble methods
After we get some relatively simple classifiers (sometimes also called weak classifiers), we might put them together to form a more complicated classifier. This type of methods is called an ensemble method. The basic way to ``ensemble’’ classifiers together to through the voting machine.
There are mainly two ways to generate many classifiers.
bagging
: This is also called bootstrap aggregating. The idea is- First we randomly pick samples from the original dataset to form a bunch of new trainning datasets;
- Then we apply the same learning methods to those trainning datasets to get a bunch of classifiers;
- Finally apply all these classifiers to the data we are interested in and use the most frequent class as the result.
boosting
: There are a bunch of classifiers. We assign weights to each of the classifiers and change the weights adaptively according to the results of the current combination.
4.1 Bootstrap aggregating
4.1.1 Basic bagging
One approach to get many estimators is to use the same training algorithm for every predictor and train them on different random subsets of the training set. When sampling is performed with replacement, this method is called bagging (short for bootstrap aggregating). When sampling is performed without replacement, it is called pasting.
Consider the following example. The dataset is the one we used in Chpater 3: make_moon
. We split the dataset into training and test sets.
We would like to sample from the dataset to get some smaller minisets. We will use sklearn.model_selection.ShuffleSplit
to perform the action.
The output of ShuffleSplit
is a generator. To get the index out of it we need a for
loop. You may check out the following code.
Note that ShuffleSplit
is originally used to shuffle data into training and test sets. We would only use the shuffle function out of it, so we will set test_size
to be 1
and use _
later in the for
loop since we won’t use that part of the information.
What we finally get is a generator rs
that produces indexes of subsets of X_train
and y_train
.
from sklearn.model_selection import ShuffleSplit
= 1000
n_trees = 100
n_instances = ShuffleSplit(n_splits=n_trees, test_size=1, train_size=n_instances).split(X_train) rs
sklearn
provides BaggingClassifier
to directly perform bagging or pasting. The code is as follows.
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
= BaggingClassifier(DecisionTreeClassifier(),
bag_clf =1000,
n_estimators=1.0,
max_samples=True) bootstrap
In the above code, bag_clf
is a bagging classifier, made of 500 DecisionTreeClassifer
s, and is trained over subsets of size 1.0
. The option bootstrap=True
means that it is bagging. If you would like to use pasting, the option is bootstrap=False
.
This bag_clf
also has .fit()
and .predict()
methods. It is used the same as our previous classifiers. Let us try the make_moon
dataset.
from sklearn.metrics import accuracy_score
bag_clf.fit(X_train, y_train)= bag_clf.predict(X_test)
y_pred_bag accuracy_score(y_pred_bag, y_test)
0.8446666666666667
4.1.2 max_samples
The original bagging algoithm requires the “subset” has the same number of datapoints as the original set. Since duplicates are allowed, we are expected to use 63% of the data from the original dataset.
Click to expand for more details.
Let the original dataset have \(N\) elements. The probability to pick any specific data is \(1/N\), and therefore the probability not to pick it is \(1-1/N\). If we pick \(N\) elements with replacement, the probability that we never pick a specific element is \[ (1-\frac1N)^N\rightarrow \mathrm e^{-1},\quad \text{when }N\rightarrow\infty. \] Therefore when \(N\) is large, the probability for each element not to be picked is approximately \(\mathrm e^{-1}\), and then the probability for it to be picked is approximately \(1-\mathrm e^{-1}\).
Let \(I_i\) be the random variable representing whether the element \(i\) is picked or not. So \[ I_i=\begin{cases}1&i \text{ is picked}\\ 0&i\text{ is not picked}\end{cases}. \] So \[ P(I_i=1)=1-\mathrm e^{-1},\quad P(I_i=0)=\mathrm e^{-1},\quad \text{ and }E[I_i]=1\cdot(1-\mathrm e^{-1})+0\cdot(\mathrm e^{-1})=1-\mathrm e^{-1}. \] The expectation of the final sample size proportion is \[ E[\frac1N\sum_{i=1}^NI_i]=\frac1N\sum_{i=1}^IE[I_i]=1-\mathrm e^{-1}\approx 0.63. \]
With similar calculation, if we pick \(\alpha N\) data, the expectation of the final sample size proportion is \(1-\mathrm e^{-\alpha}\).In sklearn
, the requiremnt is relaxed, that you may have any number of data in each sample. This gives more flexiblity to adjust your model, that the sample size may affect the performance of the model. It is set by the argument max_samples
.
- If it is float that between 0 and 1, it will be used as the proportion. The default value is \(1.0\) which means that we use the original size as the sample size.
- IF it is an integer that is bigger than \(1\), it will be used as the fixed sample size, no matter what the orginal data size is.
4.1.3 OOB score
When we use bagging
, it is possible that some of the training data are not used. In this case, we could record which data are not used, and just use them as the test set, instead of providing extra data for test. The data that are not used is called out-of-bag instances, or oob for short. The accuracy over the oob data is called the oob score.
We could set oob_score=True
to enable the function when creating a BaggingClassifier
, and use .oob_score_
to get the oob score after training.
= BaggingClassifier(DecisionTreeClassifier(),
bag_clf_oob =1000,
n_estimators=True,
bootstrap=True)
oob_score
bag_clf_oob.fit(X_train, y_train) bag_clf_oob.oob_score_
0.8398823529411765
4.1.4 Random Forests
When the classifiers used in a bagging classifier are all Decision Trees, the bagging classifier is called a random forest
. sklearn
provide RandomForestClassifier
class. It is almost the same as BaggingClassifier
+ DecisionTreeClassifer
.
from sklearn.ensemble import RandomForestClassifier
= RandomForestClassifier(n_estimators=1000, random_state=1)
rnd_clf
rnd_clf.fit(X_train, y_train) rnd_clf.score(X_test, y_test)
0.844
When we use the Decision Tree as our base estimators, the class RandomForestClassifier
provides more control over growing the random forest, with a certain optimizations. If you would like to use other estimators, then BaggingClassifier
should be used.
4.1.5 Extra-trees
When growing a Decision Tree, our method is to search through all possible ways to find the best split point that get the lowest Gini impurity. Anohter method is to use a random split. Of course a random tree performs much worse, but if we use it to form a random forest, the voting system can help to increase the accuracy. On the other hand, random split is much faster than a regular Decision Tree.
This type of forest is called Extremely Randomized Trees, or Extra-Trees for short. We could modify the above random forest classifier code to implement the extra-tree algorithm. The key point is that we don’t apply the Decision Tree algorithm. Instead we perform a random split: select a random feature, and split at a random value. In sklearn
there is an ExtraTreesClassifier
to create such a classifier.
Note that for Extra-Tree, we use the whole dataset instead of samples, because the random split already ensures diversity so we need to reduce variance from resampling. So technically Extra-Tree is not a Bagging method in the standard way, although it looks like one.
from sklearn.ensemble import ExtraTreesClassifier
= ExtraTreesClassifier(n_estimators=1000, random_state=1)
ext_clf
ext_clf.fit(X_train, y_train) ext_clf.score(X_test, y_test)
0.8466666666666667
In the above example, RandomForestClassifier
and ExtraTreesClassifier
get similar accuracy. However from the code below, you will see that in this example ExtraTreesClassifier
is much faster than RandomForestClassifier
.
from time import time
= time()
t0 = RandomForestClassifier(n_estimators=1000, random_state=1)
rnd_clf
rnd_clf.fit(X_train, y_train)= time()
t1 print("Random Frorest: {}".format(t1 - t0))
= time()
t0 = ExtraTreesClassifier(n_estimators=1000, random_state=1)
ext_clf
ext_clf.fit(X_train, y_train)= time()
t1 print("Extremely Randomized Trees: {}".format(t1 - t0))
Random Frorest: 11.967942476272583
Extremely Randomized Trees: 5.601980447769165
4.1.6 Gini importance
After training a Decision Tree, we could look at each node. Each split is against a feature, which decrease the Gini impurity the most. In other words, we could say that the feature is the most important during the split.
Using the average Gini impurity decreased as a metric, we could measure the importance of each feature. This is called Gini importance. If the feature is useful, it tends to split mixed labeled nodes into pure single class nodes.
In the case of random forest, since there are many trees, we might compute the weighted average of the Gini importance across all trees. The weight depends on how many times the feature is used in a specific node.
Using RandomForestClassifier
, we can directly get access to the Gini importance of each feature by .feature_importance_
. Please see the following example.
rnd_clf.fit(X_train, y_train) rnd_clf.feature_importances_
array([0.46353558, 0.53646442])
In this example, you may see that the two features are relavely equally important, where the second feature is slightly more important since on average it decrease the Gini impurity a little bit more.
4.2 Voting machine
4.2.1 Voting classifier
Assume that we have several trained classifiers. The easiest way to make a better classifer out of what we already have is to build a voting system. That is, each classifier give its own prediction, and it will be considered as a vote, and finally the highest vote will be the prediction of the system.
In sklearn
, you may use VotingClassifier
. It works as follows.
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
= [('knn', KNeighborsClassifier(n_neighbors=5)),
clfs 'dt', DecisionTreeClassifier(max_depth=2))]
(= VotingClassifier(estimators=clfs, voting='hard') voting_clf
All classifiers are stored in the list clfs
, whose elements are tuples. The syntax is very similar to Pipeline
. What the classifier does is to train all listed classifiers and use the majority vote to predict the class of given test data. If each classifier has one vote, the voting method is hard
. There is also a soft
voting method. In this case, every classifiers not only can predict the classes of the given data, but also estimiate the probability of the given data that belongs to certain classes. On coding level, each classifier should have the predict_proba()
method. In this case, the weight of each vote is determined by the probability computed. In our course we mainly use hard
voting.
Let us use make_moon
as an example. We first load the dataset.
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
= make_moons(n_samples=10000, noise=0.4, random_state=42)
X, y = train_test_split(X, y, test_size=0.15) X_train, X_test, y_train, y_test
We would like to apply kNN model. As before, we build a data pipeline pipe
to first apply MinMaxScaler
and then KNeighborsClassifier
.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
= Pipeline(steps=[('scalar', MinMaxScaler()),
pipe 'knn', KNeighborsClassifier())])
(= {'knn__n_neighbors': list(range(1, 51))}
parameters = GridSearchCV(pipe, param_grid=parameters)
gs_knn
gs_knn.fit(X_train, y_train)= gs_knn.best_estimator_
clf_knn clf_knn.score(X_test, y_test)
0.8666666666666667
The resulted accuracy is shown above.
We then try it with the Decision Tree.
from sklearn.tree import DecisionTreeClassifier
= GridSearchCV(DecisionTreeClassifier(), param_grid={'max_depth': list(range(1, 11)), 'max_leaf_nodes': list(range(10, 30))})
gs_dt
gs_dt.fit(X_train, y_train)= gs_dt.best_estimator_
clf_dt clf_dt.score(X_test, y_test)
0.8613333333333333
We would also want to try Logistic regression method. This will be covered in the next Chapter. At current stage we just use the default setting without changing any hyperparameters.
from sklearn.linear_model import LogisticRegression
= LogisticRegression()
clf_lr
clf_lr.fit(X_train, y_train) clf_lr.score(X_test, y_test)
0.8393333333333334
Now we use a voting classifier to combine the results.
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
= [('knn', KNeighborsClassifier()),
clfs 'dt', DecisionTreeClassifier()),
('lr', LogisticRegression())]
(= VotingClassifier(estimators=clfs, voting='hard')
voting_clf
voting_clf.fit(X_train, y_train) voting_clf.score(X_test, y_test)
0.85
You may compare the results of all these four classifiers. The voting classifier is not guaranteed to be better. It is just a way to form a model.
4.3 AdaBoost
This is the first algorithm that successfully implements the boosting idea. AdaBoost
is short for Adaptive Boosting.
4.3.1 Weighted dataset
We firstly talk about training a Decision Tree on a weighted dataset. The idea is very simple. When building a Decision Tree, we use some method to determine the split. In this course the Gini impurity is used. There are at least two other methods: cross-entropy and misclassified rate. For all three, the count of the elemnts in some classes is the essnetial part. To train the model over the weighted dataset, we just need to upgrade the count of the elements by the weighted count.
Example 4.1 Consider the following data:
x0 | x1 | y | Weight |
---|---|---|---|
1.0 | 2.1 | + | 0.5 |
1.0 | 1.1 | + | 0.125 |
1.3 | 1.0 | - | 0.125 |
1.0 | 1.0 | - | 0.125 |
2.0 | 1.0 | + | 0.125 |
The weighted Gini impurity is
\[ \text{WeightedGini}=1-(0.5+0.125+0.125)^2-(0.125+0.125)^2=0.375. \]
You may see that the original Gini impurity is just the weighted Gini impurity with equal weights. Therefore the first tree we get from AdaBoost
(see below) is the same tree we get from the Decision Tree model in Chpater 3.
4.3.2 General process
Here is the rough description of AdaBoost
.
- Assign weights to each data point. At the begining we could assign weights equally.
- Train a classifier based on the weighted dataset, and use it to predict on the training set. Find out all wrong answers.
- Adjust the weights, by inceasing the weights of data points that are done wrongly in the previous generation.
- Train a new classifier using the new weighted dataset. Predict on the training set and record the wrong answers.
- Repeat the above process to get many classifiers. The training stops either by hitting \(0\) error rate, or after a specific number of rounds.
- The final results is based on the weighted total votes from all classifiers we trained.
Now let us talk about the details. Assume there are \(N\) data points. Then the inital weights are set to be \(\dfrac1N\). There are 2 sets of weights. Let \(w^{(i)}\) be weights of the \(i\)th data points. Let \(\alpha_j\) be the weights of the \(j\)th classifier. After training the \(j\)th classifier, the error rate is denoted by \(e_j\). Then we have
\[ e_j=\frac{\text{the total weights of data points that are misclassified by the $j$th classifier}}{\text{the total weights of data points}} \]
\[ \alpha_j=\eta\ln\left(\dfrac{1-e_j}{e_j}\right). \]
\[ w^{(i)}_{\text{new}}\leftarrow\text{normalization} \leftarrow w^{(i)}\leftarrow\begin{cases}w^{(i)}&\text{if the $i$th data is correctly classified,}\\w^{(i)}\exp(\alpha_j)&\text{if the $i$th data is misclassified.}\end{cases} \]
The first tree is the same tree we get from the regular Decision Tree model. In the rest of the training process, more weights are put on the data that we are wrong in the previous iteration. Therefore the process is the mimic of “learning from mistakes”.
The \(\eta\) in computing \(\alpha_j\) is called the learning rate. It is a hyperparameter that will be specified mannually. It does exactly what it appears to do: alter the weights of each classifier. The default is 1.0
. When the number is very small (which is recommended although it can be any positive number), more iterations will be expected.
4.3.3 Example 1: the iris
dataset
Similar to all previous models, sklearn
provides AdaBoostClassifier
. The way to use it is similar to previous models. Note that although we are able to use any classifiers for AdaBoost
, the most popular choice is Decision Tree with max_depth=1
. This type of Decision Trees are also called Decision Stumps.
In the following examples, we initialize an AdaBoostClassifier
with 500 Decision Stumps and learning_rate=0.5
.
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
= AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=1000,
ada_clf =.5) learning_rate
We will use the iris
dataset for illustration. The cross_val_score is calculated as follows.
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
= load_iris(return_X_y=True)
X, y = cross_val_score(ada_clf, X, y, cv=5)
scores scores.mean()
np.float64(0.9466666666666667)
4.3.4 Example 2: the Horse Colic dataset
This dataset is from UCI Machine Learning Repository. The data is about whether horses survive if they get a disease called Colic. The dataset is preprocessed, and the result can be downloaded here. Note that based on the description, we only use the first 23 columns, and we change the original 3 outcomes (lived, died was euthanized) into 2 outcomes (lived, died) by combining outcome 2 and 3.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
= "assests/datasets/horse_colic_clean.csv"
filepath = pd.read_csv(filepath)
df = df.iloc[:, :22].to_numpy().astype(float)
X = (df.iloc[:, 22]<2).to_numpy().astype(int)
y = train_test_split(X, y, test_size=0.15) X_train, X_test, y_train, y_test
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
= 1
RANDOMSTATE = AdaBoostClassifier(
ada =DecisionTreeClassifier(max_depth=1, random_state=RANDOMSTATE),
estimator=RANDOMSTATE
random_state
)
# Hyperparameter grid
= {
param_grid "n_estimators": [50, 100, 200],
"learning_rate": [0.01, 0.1, 1.0],
"estimator__max_depth": [1, 2, 3]
}
# Grid search with cross-validation
= GridSearchCV(
grid =ada,
estimator=param_grid,
param_grid=5,
cv="accuracy"
scoring
)
grid.fit(X_train, y_train) grid.best_estimator_.score(X_test, y_test)
0.6964285714285714
grid.best_params_
{'estimator__max_depth': 2, 'learning_rate': 0.01, 'n_estimators': 200}
4.4 Exercises
Exercise 4.1 CHOOSE ONE: Please apply the random forest to one of the following datasets.
- the
iris
dataset. - the dating dataset.
- the
titanic
dataset.
Please answer the following questions.
- Please use grid search to find the good
max_leaf_nodes
andmax_depth
. - Please record the cross-validation score and the OOB score of your model and compare it with the models you learned before (kNN, Decision Trees).
- Please find some typical features (using the Gini importance) and draw the Decision Boundary against the features you choose.
Exercise 4.2 Please use the following code to get the mgq
dataset.
from sklearn.datasets import make_gaussian_quantiles
= make_gaussian_quantiles(cov=2.0, n_samples=200, n_features=2,
X1, y1 =2, random_state=1)
n_classes= make_gaussian_quantiles(mean=(3, 3), cov=1.5, n_samples=300,
X2, y2 =2, n_classes=2, random_state=1)
n_features= np.concatenate((X1, X2))
X = np.concatenate((y1, -y2 + 1)) y
Please build an AdaBoost
model.
Exercise 4.3 Please use RandomForestClassifier
, ExtraTreesClassifier
and KNeighbourClassifier
to form a voting classifier, and apply to the MNIST
dataset.
MNIST
This dataset can be loaded using the following code.
import numpy as np
import requests
from io import BytesIO
= requests.get('https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz', stream = True)
r = np.load(BytesIO(r.raw.read()))
data = data['x_train']
X_train = data['x_test']
X_test = data['y_train']
y_train = data['y_test'] y_test