After we get some relatively simple classifiers (sometimes also called weak classifiers), we might put them together to form a more complicated classifier. This type of methods is called an ensemble method. The basic way to ``ensemble’’ classifiers together to through the voting machine.
There are mainly two ways to generate many classifiers.
bagging: This is also called bootstrap aggregating. The idea is
First we randomly pick samples from the original dataset to form a bunch of new trainning datasets;
Then we apply the same learning methods to those trainning datasets to get a bunch of classifiers;
Finally apply all these classifiers to the data we are interested in and use the most frequent class as the result.
boosting: There are a bunch of classifiers. We assign weights to each of the classifiers and change the weights adaptively according to the results of the current combination.
4.1 Bootstrap aggregating
4.1.1 Basic bagging
One approach to get many estimators is to use the same training algorithm for every predictor and train them on different random subsets of the training set. When sampling is performed with replacement, this method is called bagging (short for bootstrap aggregating). When sampling is performed without replacement, it is called pasting.
Consider the following example. The dataset is the one we used in Chpater 3: make_moon. We split the dataset into training and test sets.
from sklearn.datasets import make_moonsimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitX, y = make_moons(n_samples=10000, noise=0.4, random_state=42)plt.scatter(x=X[:, 0], y=X[:, 1], c=y)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
We would like to sample from the dataset to get some smaller minisets. We will use sklearn.model_selection.ShuffleSplit to perform the action.
The output of ShuffleSplit is a generator. To get the index out of it we need a for loop. You may check out the following code.
Note that ShuffleSplit is originally used to shuffle data into training and test sets. We would only use the shuffle function out of it, so we will set test_size to be 1 and use _ later in the for loop since we won’t use that part of the information.
What we finally get is a generator rs that produces indexes of subsets of X_train and y_train.
from sklearn.model_selection import ShuffleSplitn_trees =1000n_instances =100rs = ShuffleSplit(n_splits=n_trees, test_size=1, train_size=n_instances).split(X_train)
sklearn provides BaggingClassifier to directly perform bagging or pasting. The code is as follows.
In the above code, bag_clf is a bagging classifier, made of 500 DecisionTreeClassifers, and is trained over subsets of size 1.0. The option bootstrap=True means that it is bagging. If you would like to use pasting, the option is bootstrap=False.
This bag_clf also has .fit() and .predict() methods. It is used the same as our previous classifiers. Let us try the make_moon dataset.
from sklearn.metrics import accuracy_scorebag_clf.fit(X_train, y_train)y_pred_bag = bag_clf.predict(X_test)accuracy_score(y_pred_bag, y_test)
0.8406666666666667
4.1.2max_samples
The original bagging algoithm requires the “subset” has the same number of datapoints as the original set. Since duplicates are allowed, we are expected to use 63% of the data from the original dataset.
Click to expand for more details.
Let the original dataset have \(N\) elements. The probability to pick any specific data is \(1/N\), and therefore the probability not to pick it is \(1-1/N\). If we pick \(N\) elements with replacement, the probability that we never pick a specific element is \[
(1-\frac1N)^N\rightarrow \mathrm e^{-1},\quad \text{when }N\rightarrow\infty.
\] Therefore when \(N\) is large, the probability for each element not to be picked is approximately \(\mathrm e^{-1}\), and then the probability for it to be picked is approximately \(1-\mathrm e^{-1}\).
Let \(I_i\) be the random variable representing whether the element \(i\) is picked or not. So \[
I_i=\begin{cases}1&i \text{ is picked}\\
0&i\text{ is not picked}\end{cases}.
\] So \[
P(I_i=1)=1-\mathrm e^{-1},\quad P(I_i=0)=\mathrm e^{-1},\quad \text{ and }E[I_i]=1\cdot(1-\mathrm e^{-1})+0\cdot(\mathrm e^{-1})=1-\mathrm e^{-1}.
\] The expectation of the final sample size proportion is \[
E[\frac1N\sum_{i=1}^NI_i]=\frac1N\sum_{i=1}^IE[I_i]=1-\mathrm e^{-1}\approx 0.63.
\]
With similar calculation, if we pick \(\alpha N\) data, the expectation of the final sample size proportion is \(1-\mathrm e^{-\alpha}\).
In sklearn, the requiremnt is relaxed, that you may have any number of data in each sample. This gives more flexiblity to adjust your model, that the sample size may affect the performance of the model. It is set by the argument max_samples.
If it is float that between 0 and 1, it will be used as the proportion. The default value is \(1.0\) which means that we use the original size as the sample size.
IF it is an integer that is bigger than \(1\), it will be used as the fixed sample size, no matter what the orginal data size is.
4.1.3 OOB score
When we use bagging, it is possible that some of the training data are not used. In this case, we could record which data are not used, and just use them as the test set, instead of providing extra data for test. The data that are not used is called out-of-bag instances, or oob for short. The accuracy over the oob data is called the oob score.
We could set oob_score=True to enable the function when creating a BaggingClassifier, and use .oob_score_ to get the oob score after training.
When the classifiers used in a bagging classifier are all Decision Trees, the bagging classifier is called a random forest. sklearn provide RandomForestClassifier class. It is almost the same as BaggingClassifier + DecisionTreeClassifer.
from sklearn.ensemble import RandomForestClassifierrnd_clf = RandomForestClassifier(n_estimators=1000, random_state=1)rnd_clf.fit(X_train, y_train)rnd_clf.score(X_test, y_test)
0.8446666666666667
When we use the Decision Tree as our base estimators, the class RandomForestClassifier provides more control over growing the random forest, with a certain optimizations. If you would like to use other estimators, then BaggingClassifier should be used.
4.1.5 Extra-trees
When growing a Decision Tree, our method is to search through all possible ways to find the best split point that get the lowest Gini impurity. Anohter method is to use a random split. Of course a random tree performs much worse, but if we use it to form a random forest, the voting system can help to increase the accuracy. On the other hand, random split is much faster than a regular Decision Tree.
This type of forest is called Extremely Randomized Trees, or Extra-Trees for short. We could modify the above random forest classifier code to implement the extra-tree algorithm. The key point is that we don’t apply the Decision Tree algorithm. Instead we perform a random split: select a random feature, and split at a random value. In sklearn there is an ExtraTreesClassifier to create such a classifier.
Note that for Extra-Tree, we use the whole dataset instead of samples, because the random split already ensures diversity so we need to reduce variance from resampling. So technically Extra-Tree is not a Bagging method in the standard way, although it looks like one.
from sklearn.ensemble import ExtraTreesClassifierext_clf = ExtraTreesClassifier(n_estimators=1000, random_state=1)ext_clf.fit(X_train, y_train)ext_clf.score(X_test, y_test)
0.8386666666666667
In the above example, RandomForestClassifier and ExtraTreesClassifier get similar accuracy. However from the code below, you will see that in this example ExtraTreesClassifier is much faster than RandomForestClassifier.
Random Frorest: 11.222781658172607
Extremely Randomized Trees: 4.594525337219238
4.1.6 Gini importance
After training a Decision Tree, we could look at each node. Each split is against a feature, which decrease the Gini impurity the most. In other words, we could say that the feature is the most important during the split.
Using the average Gini impurity decreased as a metric, we could measure the importance of each feature. This is called Gini importance. If the feature is useful, it tends to split mixed labeled nodes into pure single class nodes.
In the case of random forest, since there are many trees, we might compute the weighted average of the Gini importance across all trees. The weight depends on how many times the feature is used in a specific node.
Using RandomForestClassifier, we can directly get access to the Gini importance of each feature by .feature_importance_. Please see the following example.
In this example, you may see that the two features are relavely equally important, where the second feature is slightly more important since on average it decrease the Gini impurity a little bit more.
4.2 Voting machine
4.2.1 Voting classifier
Assume that we have several trained classifiers. The easiest way to make a better classifer out of what we already have is to build a voting system. That is, each classifier give its own prediction, and it will be considered as a vote, and finally the highest vote will be the prediction of the system.
In sklearn, you may use VotingClassifier. It works as follows.
All classifiers are stored in the list clfs, whose elements are tuples. The syntax is very similar to Pipeline. What the classifier does is to train all listed classifiers and use the majority vote to predict the class of given test data. If each classifier has one vote, the voting method is hard. There is also a soft voting method. In this case, every classifiers not only can predict the classes of the given data, but also estimiate the probability of the given data that belongs to certain classes. On coding level, each classifier should have the predict_proba() method. In this case, the weight of each vote is determined by the probability computed. In our course we mainly use hard voting.
Let us use make_moon as an example. We first load the dataset.
from sklearn.datasets import make_moonsfrom sklearn.model_selection import train_test_splitX, y = make_moons(n_samples=10000, noise=0.4, random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
We would like to apply kNN model. As before, we build a data pipeline pipe to first apply MinMaxScaler and then KNeighborsClassifier.
We would also want to try Logistic regression method. This will be covered in the next Chapter. At current stage we just use the default setting without changing any hyperparameters.
from sklearn.linear_model import LogisticRegressionclf_lr = LogisticRegression()clf_lr.fit(X_train, y_train)clf_lr.score(X_test, y_test)
0.8346666666666667
Now we use a voting classifier to combine the results.
You may compare the results of all these four classifiers. The voting classifier is not guaranteed to be better. It is just a way to form a model.
4.3AdaBoost
This is the first algorithm that successfully implements the boosting idea. AdaBoost is short for Adaptive Boosting.
4.3.1 Weighted dataset
We firstly talk about training a Decision Tree on a weighted dataset. The idea is very simple. When building a Decision Tree, we use some method to determine the split. In this course the Gini impurity is used. There are at least two other methods: cross-entropy and misclassified rate. For all three, the count of the elemnts in some classes is the essnetial part. To train the model over the weighted dataset, we just need to upgrade the count of the elements by the weighted count.
You may see that the original Gini impurity is just the weighted Gini impurity with equal weights. Therefore the first tree we get from AdaBoost (see below) is the same tree we get from the Decision Tree model in Chpater 3.
4.3.2 General process
Here is the rough description of AdaBoost.
Assign weights to each data point. At the begining we could assign weights equally.
Train a classifier based on the weighted dataset, and use it to predict on the training set. Find out all wrong answers.
Adjust the weights, by inceasing the weights of data points that are done wrongly in the previous generation.
Train a new classifier using the new weighted dataset. Predict on the training set and record the wrong answers.
Repeat the above process to get many classifiers. The training stops either by hitting \(0\) error rate, or after a specific number of rounds.
The final results is based on the weighted total votes from all classifiers we trained.
Now let us talk about the details. Assume there are \(N\) data points. Then the inital weights are set to be \(\dfrac1N\). There are 2 sets of weights. Let \(w^{(i)}\) be weights of the \(i\)th data points. Let \(\alpha_j\) be the weights of the \(j\)th classifier. After training the \(j\)th classifier, the error rate is denoted by \(e_j\). Then we have
\[
e_j=\frac{\text{the total weights of data points that are misclassified by the $j$th classifier}}{\text{the total weights of data points}}
\]
\[
w^{(i)}_{\text{new}}\leftarrow\text{normalization} \leftarrow w^{(i)}\leftarrow\begin{cases}w^{(i)}&\text{if the $i$th data is correctly classified,}\\w^{(i)}\exp(\alpha_j)&\text{if the $i$th data is misclassified.}\end{cases}
\]
Note
The first tree is the same tree we get from the regular Decision Tree model. In the rest of the training process, more weights are put on the data that we are wrong in the previous iteration. Therefore the process is the mimic of “learning from mistakes”.
Note
The \(\eta\) in computing \(\alpha_j\) is called the learning rate. It is a hyperparameter that will be specified mannually. It does exactly what it appears to do: alter the weights of each classifier. The default is 1.0. When the number is very small (which is recommended although it can be any positive number), more iterations will be expected.
4.3.3 Example 1: the iris dataset
Similar to all previous models, sklearn provides AdaBoostClassifier. The way to use it is similar to previous models. Note that although we are able to use any classifiers for AdaBoost, the most popular choice is Decision Tree with max_depth=1. This type of Decision Trees are also called Decision Stumps.
In the following examples, we initialize an AdaBoostClassifier with 500 Decision Stumps and learning_rate=0.5.
We will use the iris dataset for illustration. The cross_val_score is calculated as follows.
from sklearn.model_selection import cross_val_scorefrom sklearn.datasets import load_irisX, y = load_iris(return_X_y=True)scores = cross_val_score(ada_clf, X, y, cv=5)scores.mean()
np.float64(0.9466666666666667)
4.3.4 Example 2: the Horse Colic dataset
This dataset is from UCI Machine Learning Repository. The data is about whether horses survive if they get a disease called Colic. The dataset is preprocessed, and the result can be downloaded here. Note that based on the description, we only use the first 23 columns, and we change the original 3 outcomes (lived, died was euthanized) into 2 outcomes (lived, died) by combining outcome 2 and 3.
import pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitfilepath ="assests/datasets/horse_colic_clean.csv"df = pd.read_csv(filepath)X = df.iloc[:, :22].to_numpy().astype(float)y = (df.iloc[:, 22]<2).to_numpy().astype(int)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
Exercise 4.3 Please use RandomForestClassifier, ExtraTreesClassifier and KNeighbourClassifier to form a voting classifier, and apply to the MNIST dataset.
NoteMNIST
This dataset can be loaded using the following code.