# Decision trees and random forests

An article describing decision trees and random forests, and how to implement them in Python Scikit-Learn.

In the previous video we talked in some detail about decision trees.

This article gives a brief recap, and demonstrates how to make decision trees and random forests using Python.

## What are decision trees?

Decision trees are a form of supervised learning, meaning that for classification problems your data belongs to a set of known target classes. It is also possible to use decision trees for regression problems, but we won’t discuss that here.

You can think of decision trees as a set of rules or questions that you apply to your dataset to split it into smaller and smaller subsets until each subset contains members of just one class, or at least the majority of members are the same class.

The root of a decision tree is the first decision node, which has a yes or no answer leading to two further nodes. The two paths from each node are known as branches or edges. The last node along each set of branches is known as a leaf and represents a decision to assign a data point to one class in particular.
All decisions in a decision tree need to have clear yes or no answers. For numerical data the decisions are generally something like “is feature X less than or equal to value Y”. You can also use categorical features, for example “does feature X belong to category Y”. We’ll talk a bit more about how each decision is chosen and evaluated in the following sections.

## How are they made?

Training a decision tree begins with the full training dataset and looks at all the possible pairs of features (X) and thresholds (Y), and evaluates which pair ((X,Y)) minimises a cost function (F), which measures how ‘pure’ the new subsets are. In other words, the better the split into unique categories, the lower the value (F) will take.
Once the first split has been made, the algorithm then looks for the best possible split of the two new nodes, and so on until one or more stopping criteria are reached. This could be that all members of a subset have the same class, or that any further split does not improve the purity score. It could also be some stopping criteria chosen by the user such as maximum depth (the number of decisions along a certain path), or minimum set size.
This is an example of a greedy algorithm. It will always look for the best split at the top level, and so on down the tree. The drawback to this approach is that the algorithm is not guaranteed to find the globally optimal solution. For example, a slightly less good initial decision may lead to much better splits later on.

## How nodes are evaluated – Gini impurity

A commonly used cost function is known as the CART (Classification and regression tree) algorithm. A key notion used by CART to evaluate how well nodes are splitting the data is a measure known as the Gini impurity ((G)). This can be defined as:

[G = 1 – sum_{i=1}^n p_i^2]

where (n) in the number of classes, and (p_i) is the proportion of instances of class (i) in the node, or the probability a data point will be of that class in that node.

So if for example we have two classes and 16 instances in a particular node, and 4 are of class A and 12 of class B:

[G = 1 – (4/16)^2 – (12/16)^2 = 0.375]

The lower the value for (G), the better the split. If all the instances at a particular node are the same class then (G=0).

When making every new pair of nodes, the CART algorithm looks at all possible pairs of features (X) and thresholds (Y), splits the dataset into (yes) and (no) subsets, and tries to minimise the following cost function (F):

[F(X,Y) = frac{m_{yes}}{m}G_{yes} + frac{m_{no}}{m}G_{no}]

where (m) is all the examples in the parent node, (m_{yes}) and (m_{no}) are the number of examples assigned to each respective group, and (G_{yes}) and (G_{no}) are the Gini impurities.

Don’t worry too much if you’re struggling with the details here, the important thing is to understand the basic concepts of nodes, decisions and branches, and how they ultimately lead to a final classification.

## Ensemble methods – Random forests

Often, as we will see with our example in Python, many more than one decision tree is constructed for a particular dataset. For obvious reasons these are known as random forests and are an example of an ensemble method.

In ensemble methods, rather than a single model, several (or many) different models are made using the data, and the predictions are based on the consensus, or most common prediction of all the different models. The idea here is that the predictions are less sensitive to possible overfitting by any individual model.

In the case of random forests, the trees are all constructed in the same basic way, but vary in terms of different subsets of training data or features used to split up the data.

To make a prediction using a random forest classifier, you just assign the class that is most commonly predicted among all the individual trees. An added feature here of course is that you can also assign a probability to this prediction based on how split the decision was among the individual decision trees.

## Example in Python

For our example in Python we will use the RandomForestClassifier() in sklearn.ensemble on the Iris dataset. As the name suggests this is a random forest ensemble method in which the predicted class in the consensus prediction of a number of different individual decision trees.

First though, let’s import and split the data in the usual way:

from sklearn.datasets import load_irisiris = load_iris()X = iris.datay = iris.targetfrom sklearn.model_selection import train_test_splitXtrain, Xtest, y_train, y_test = train_test_split(X, y)

As usual we need to set up the model by importing and creating an instance of the model class, and then using the fit() method to train the model:

from sklearn.ensemble import RandomForestClassifiermodel=RandomForestClassifier(n_estimators=1000,random_state=13)model.fit(Xtrain,y_train)

The n_estimators parameters sets the number of decision trees to use in the forest. We have set it to be 1000, but the default value is 100.

We can then evaluate the accuracy in the same way as with the models we’ve tried before:

from sklearn.metrics import accuracy_scorey_predicted = model.predict(Xtest)score = accuracy_score(y_test,y_predicted)print(score)
0.9736842105263158