Decision trees and random forests
In the previous video we talked in some detail about decision trees.
This article gives a brief recap, and demonstrates how to make decision trees and random forests using Python.
What are decision trees?
Decision trees are a form of supervised learning, meaning that for classification problems your data belongs to a set of known target classes. It is also possible to use decision trees for regression problems, but we won’t discuss that here.
You can think of decision trees as a set of rules or questions that you apply to your dataset to split it into smaller and smaller subsets until each subset contains members of just one class, or at least the majority of members are the same class.
The root of a decision tree is the first decision node, which has a yes or no answer leading to two further nodes. The two paths from each node are known as branches or edges. The last node along each set of branches is known as a leaf and represents a decision to assign a data point to one class in particular.
How are they made?
This is an example of a greedy algorithm. It will always look for the best split at the top level, and so on down the tree. The drawback to this approach is that the algorithm is not guaranteed to find the globally optimal solution. For example, a slightly less good initial decision may lead to much better splits later on.
How nodes are evaluated – Gini impurity
A commonly used cost function is known as the CART (Classification and regression tree) algorithm. A key notion used by CART to evaluate how well nodes are splitting the data is a measure known as the Gini impurity ((G)). This can be defined as:
[G = 1 – sum_{i=1}^n p_i^2]
where (n) in the number of classes, and (p_i) is the proportion of instances of class (i) in the node, or the probability a data point will be of that class in that node.
So if for example we have two classes and 16 instances in a particular node, and 4 are of class A and 12 of class B:
[G = 1 – (4/16)^2 – (12/16)^2 = 0.375]
The lower the value for (G), the better the split. If all the instances at a particular node are the same class then (G=0).
When making every new pair of nodes, the CART algorithm looks at all possible pairs of features (X) and thresholds (Y), splits the dataset into (yes) and (no) subsets, and tries to minimise the following cost function (F):
[F(X,Y) = frac{m_{yes}}{m}G_{yes} + frac{m_{no}}{m}G_{no}]
where (m) is all the examples in the parent node, (m_{yes}) and (m_{no}) are the number of examples assigned to each respective group, and (G_{yes}) and (G_{no}) are the Gini impurities.
Don’t worry too much if you’re struggling with the details here, the important thing is to understand the basic concepts of nodes, decisions and branches, and how they ultimately lead to a final classification.
Ensemble methods – Random forests
Often, as we will see with our example in Python, many more than one decision tree is constructed for a particular dataset. For obvious reasons these are known as random forests and are an example of an ensemble method.
In ensemble methods, rather than a single model, several (or many) different models are made using the data, and the predictions are based on the consensus, or most common prediction of all the different models. The idea here is that the predictions are less sensitive to possible overfitting by any individual model.
In the case of random forests, the trees are all constructed in the same basic way, but vary in terms of different subsets of training data or features used to split up the data.
To make a prediction using a random forest classifier, you just assign the class that is most commonly predicted among all the individual trees. An added feature here of course is that you can also assign a probability to this prediction based on how split the decision was among the individual decision trees.
Example in Python
For our example in Python we will use the RandomForestClassifier() in sklearn.ensemble on the Iris dataset. As the name suggests this is a random forest ensemble method in which the predicted class in the consensus prediction of a number of different individual decision trees.
First though, let’s import and split the data in the usual way:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
Xtrain, Xtest, y_train, y_test = train_test_split(X, y)
As usual we need to set up the model by importing and creating an instance of the model class, and then using the fit() method to train the model:
from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier(n_estimators=1000,random_state=13)
model.fit(Xtrain,y_train)
The n_estimators parameters sets the number of decision trees to use in the forest. We have set it to be 1000, but the default value is 100.
We can then evaluate the accuracy in the same way as with the models we’ve tried before:
from sklearn.metrics import accuracy_score
y_predicted = model.predict(Xtest)
score = accuracy_score(y_test,y_predicted)
print(score)
0.9736842105263158
Reach your personal and professional goals
Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.
Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.
Register to receive updates
-
Create an account to receive our newsletter, course recommendations and promotions.
Register for free