Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

Classification tree & Employee retention

Let's look at an example where we solve business problems using classification tree method.

Now, you know what supervised learning and unsupervised learning are. When we do supervised learning, we have to provide the dataset with features and a target so a model can be trained to minimize the difference between the answer from the model and a target.

Minimizing the difference between the answer from the model and a target is a core part of machine learning. We call this minimizing function “objective function.” This week, let’s discuss supervised learning a little more. In our lectures, I will tell you about the classification tree and the artificial neural network.

In the last lecture, I briefly discussed the telecommunication company’s churning problem. We all know that it is crucial not to lose customers. But what about your employees?

Employee retention is an ability to keep employees. Employee retention is usually represented as a percentage. For example, an annual retention rate of 80% indicates an organization kept 80% of its employees that year and lost 20%. Employee turnover costs the business money. Then how can you make your employees stay with your company for a long time?

We can investigate this using the data and classification tree method.

Let me use HR data from Kaggle to explain further. Our dataset looks like this. We have 8 attributes. Remember what attribute is? We called an attribute an independent variable. Attributes are inputs to explain our target variable. Attributes are information to use to decide the outcomes.

Our attributes are the following. First employee satisfaction, last term’s HR evaluation score, the number of projects the employee participated in, total years of work experience, whether there was an accident during working, whether there was a promotion for the last five years, level of salary.

And the target variable that we want to know is whether the employee left. The model will use features as inputs to produce the output of the target variable, whether employees left.

Before we proceed further, I like to get your attention to different data types. Among features, accidents, promotion, and salary have the other type of data comparing to others. When there was an accident, the accident variable will have a value of 1. If not, it will have a value of 0. These 1 and 0 are not 1 or 0. Promotion is also the same. Salary has three values that are low, medium, and high. These are not numerical values. On the other hand, other features have numbers like 0.34. It means 0.34. We can have both types of data in our model.

First, let’s imagine we have 14999 employees’ data. Based on the feature of satisfaction, we will make two groups. If the satisfaction score is greater than 0.47, it belongs to the left group. If the satisfaction score is less than 0.47, it belongs to the right group.

Then, we will use the feature of total time for the group of the satisfaction score is higher than 0.47. On the other hand, we will use the feature of the number of projects for the group with a satisfaction score of less than 0.47.

When we divided 14999 employees, the first criteria was satisfaction score higher than 0.47, and less than 0.47. We could have 10816 employees in the group of higher satisfaction score and 4183 employees in the group of lower satisfaction score. In the first group of 10816 employees, there were 9776 employees who did not leave the company. In the second group of 4183 employees, there were 2531 employees who left the company. So the satisfaction score did not divide the employees into the ones who left and who did not leave ideally. However, it was not perfect but it was doing some work.

Next, we will divide the group with a higher satisfaction score again using total time. One group is people who spent more than five years, and the other group is who spent less than five years. Out of 10816 employees, 8834 employees spent less than five years and 1982 employees spent more than five years. Among 8834 employees who spent less than five years, 8706 employees stay at the company. But among 1982 employees who spent more than five years, only 1070 employees remain at the company. Now, we are using one more criterion to divide this group.

After we keep doing this, we will build the tree looking like this. And we call this “classification tree.” When you look at the tree, we call the final circles as leaves. Our goal is to have homogeneous leaves as possible. Look at this leaf. You will see that there are 60 employees belong to this leaf. Out of 60 employees, 60 employees did not leave the company. Who are they? Just follow our tree. They are satisfaction score is higher than 0.47, the total time is greater than five years, the evaluation score is higher than 0.81, the average working hours are more than 217 hours, and total time spent there is again more than seven years! If someone meets these criteria, they are unlikely to leave the company.

How do we build this classification tree? It takes a lot of iteration within a computer to find which criteria we will apply to build the tree. Of course, this will be taken care of by algorithm and a computer.

This article is from the free online

Artificial Intelligence and Machine Learning for Business

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now