Skip main navigation

Feature extraction

What is feature extraction and how can we use it to improve our data model? Find out in this article.

By introducing the concept of dimensionality reduction, we realise the importance of feature selection for data analytics. However, which features are the best to select for our model? That is one of the critical questions to answer when you start working with a dataset.

Starting the data analysis process and considering all the dataset features could make our work quite challenging and, in addition, visualising complex data is difficult for the human eye to understand. On the other hand, selecting features based on unfounded assumptions or by an arbitrary selection mechanism could make the data scientist’s life even more complex.

Each feature plays an important role when building a model and the selection of the essential features to use is a quite challenging task for every data scientist. As we get to know the data better, we become more familiar with it and we gain a better understanding of the hidden relationships and patterns within it.

An example of feature extraction

Let us consider a simple example to introduce the concept of feature extraction. Assume that we want to model the relationship between a person’s salary and the city they are living in and we are working with a very large dataset that includes columns such as username, address, mobile phone number, email, LinkedIn account user name and so on. We can quickly identify that the person’s mobile phone number is a redundant feature for our model, as it does not add any value to modelling our task. We can easily conclude that it is actually an identifier (a unique record per user) rather than a feature that can affect our model. Since each dataset includes a variety of features, we can say that as data scientists we are ‘cursed’ in the requirement to handle such high-dimensional situations, where a rich features dataset may affect our model, however, on the positive side, a ‘rich’ dataset gives us the opportunity to model the data set in several ways.

Now, what if we can combine features to make our model even better? One way to do this is to merge highly correlated features into a single feature, for example, a person’s salary may be strongly correlated with a person’s age, so we can easily create a new feature person’s salary per age. The new person’s salary per age feature is actually a reduced feature vector that will be the input to our model.

In general, feature extraction’s aim is to reduce the representation of the initial dataset in a way such that a model can use the new feature vector instead of the initial data to improve the performance of our model, for example making predictions more accurate. Try to think carefully, given what you have been taught why might this be the case?

The next tutorial will introduce you to the concept of data analytics and simple visualisations to identify key features to work with. We will use two datasets, the California housing dataset, and the Boston dataset from the sklearn datasets library.

We will work directly on the Jupyter Notebook where we can download and import the data and run simple analytics including:

  • Export data and display features to understand the dataset

  • Plot simple histograms to identify distribution of data and promising features

  • Identify features that are correlated and show related patterns

Your Task

This task involves using the California Housing dataset for visualisations (10 mins)
Study and run the California Housing example in this Jupyter notebook. The famous California housing dataset. The California housing dataset contains data drawn from the 1990 U.S. Census and reports on the geolocation and features of individual properties, eg, number of rooms, median age of residents, household income, and property value for the population of California.
Run each cell and observe which metric is being applied, and the types of visualisation being used to summarise the results.
By the end of the task, you will be familiar with applying common summary statistics, as well as more advanced measures including comparing the correlation between groups of variables, and plotting the results.
Visit the Jupyter Notebook task
© Coventry University. CC BY-NC 4.0
This article is from the free online

Applied Data Science

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now