# Decision trees

In this step, you’ll explore decision trees, another common algorithm used to solve classification- and regression-based problems.
2.9
What supervised learning algorithms can you use? A decision tree can be used to solve classification and regression problems. The decision tree algorithm is a flowchart with a hierarchical structure. The starting point is the root. This is where all of the attributes in your data set are being evaluated. During the training step, the machine learning algorithm will determine the splitting point that best separates the data at the root. This will have a binary true or false outcome. And from that point, the data is branched off in two directions. More binary questions will be asked at the decision nodes. This recursive partitioning of the data continues until a terminal or leaf node is reached.
48.9
At this point, the data isn’t split any further and a prediction can now be made. The main challenge for your algorithm during the training phase is identifying the best attribute to split the data. One of the most popular methods is information gain. This is a measure of how effectively the data has been split at each decision node. The higher the information gain after a split, the closer the algorithm is to making a prediction. Imagine that you wanted to use a decision tree to classify a celestial body as either a moon or a planet. This split gives low information gain and the algorithm is no closer to making a decision.
90.8
If the data is split in an uneven way the information gain is higher and the algorithm is now closer to making a decision. In the step below, you’ll find a visual representation of a decision tree. Use this along with the sample data provided to work out how the algorithm would classify this data.

So far, you’ve investigated several algorithms that machine learning uses to make decisions. In this step, you’ll explore decision trees, another common algorithm used to solve classification- and regression-based problems.

The use of decision trees is not new, humans have been using them to help make decisions for a long time. Take a look at the decision tree below that I have made to help me identify planets in our solar system.

Whilst this was not created by a machine, there are many similarities between this and a machine learning decision tree. Both the tree above and the machine learning algorithms use binary decisions (ones with a true or false outcome) to split the data until they reach the final result. A key difference is that machine learning algorithms are applied to much larger data sets than in this example, with the machine working out the most efficient splits of the data to be able to make a prediction in as few steps as possible.

### The decision tree algorithm

The algorithm is a tree-like flowchart with a hierarchical structure. The starting point is the root, which is all of the attributes of all of the pieces of data that the algorithm will evaluate. In the example above, this would be the planetary data for planets, such as composition, number of moons, mass, and gravity.

During the training step, the machine learning algorithm will determine how best to split the data — I’ll explain how later. This split will correspond to a question with a binary true/false outcome. From this initial splitting point, the data branches off into two directions, to other decision nodes — at each of these, the model will split the remaining data further using another binary question.

During training, the machine will continue to partition the data (known as recursive partitioning) until a leaf/terminal node is reached. A leaf/terminal node is the point where the data isn’t split any further and a prediction can be made.

### Working out the best separation

The main challenge for the machine learning algorithm during the training phase is to identify the attribute that most effectively splits the data at the root and at each decision node until the algorithm reaches a leaf. The two most popular methods are known as information gain and Gini impurity. Both achieve similar results, but in this step, I will focus on information gain. You can optionally read more about Gini impurity at towardsdatascience.com.

### Information gain

The aim of the information gain algorithm is to reduce the amount of information needed in order to make a decision. Information gain itself is a measure of how effectively the data has been split at each decision node. At each decision node, the aim is to get as high of an information gain as possible. Each split will result in some information being gained — the higher the information gain, the closer the model will then be to being able to make a prediction.

Imagine that you wanted a machine learning algorithm to classify a celestial body in our solar system as either a moon or a planet. If the split being made partitions the data by using the known gravity value, then the information gain would be low. This is because after the split you still have a fairly random split of planets and moons in each of the two nodes and the algorithm is no closer in being able to make a prediction. If the split instead partitions the data by whether or not it has a solid core then the information gain would be higher. This is because whilst some planets are gas giants, all moons have a solid core and therefore after the split, one node would only contain planets and the other node will consist of a mix of planets and moons with the vast majority being moons.

### Knowing when to stop

One disadvantage of a decision tree is the risk of overfitting. If there is one decision path for every single sample within your training data, the model will be overfitted and is highly unlikely to be as accurate when the test data is used. A way to prevent this problem is by using pruning. Pruning techniques include:

• Setting a maximum number of leaf nodes that can be generated
• Selecting a minimum number of samples needed to produce a new node or leaf
• Setting a maximum depth for the tree (number of times it’s allowed to split the data)

When pruning, you need to find a balance between underfitting and overfitting the data in an attempt to find the sweet spot.

### Planets vs moons

The following diagram is a visual representation of the decision tree discussed earlier that classifies a celestial body as either a moon or a planet.

Note: a negative Rota period indicates a celestial body that is rotating in the opposite direction to Earth

Data taken from NASA’s planetery factsheet

In this diagram, the root node is attempting to split 13 samples (samples = 13) of which 6 are moons and 7 are planets ([6,7]). There are more planets than moons which is why class = Planet because it is more likely at this stage that a sample is a planet rather than a moon. The algorithm has determined that splitting these samples by checking whether gravity is less than or equal to 2.75 will lead to the highest information gain. The next two nodes show the result of this split. Every sample that returned False for this condition was a planet and of those that returned True, 6 were moons and 1 was a planet.

Use the following two sets of data and use the tree to work out how the algorithm would classify them.

Diameter (km) Mass (1024 kg) Density (kg/m3) Gravity (m/s2) Rotation period (hours)
1 3643 89.3 3530 1.8 42.5
2 4897 330 5427 3.7 1407.6