Skip main navigation

CRISP-DM Example

An article showing an example of what using the Team Data Science process might look like.
© Luleå University of Technology

In this article, we administer an example of using the CRISP-DM process with a dataset that we download and bring to Orange. Please refer to the article on CRISP-DM to refresh your knowledge if needed. The six phases are explained in the below example. If you are unable to perform the exercise, please follow along the process below as best you can.

Business understanding

We, in this example, use household energy consumption as a lens through which we look at climate change. Households have been encouraged to reduce their energy use to help mitigate climate change.

In this very particular example, an energy company was identified as it plans to introduce new cost schemes for individual household electric power consumption to the market, and needs an estimate of the clusters it can make specifically for a historical database of individual consumption. Once the clusters are defined, the company will prepare as many plans as groups of the identified forms of consumption.

Data understanding

The database was found at https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption in the UCI machine learning repository. The data is about individual household electric power consumption datasets. This archive contains a little more than 2 million records. Precisely 2,075,259 measurements were gathered in a house located in Sceaux (7 km from Paris, France) between December 2006 and November 2010 (47 months). All calendar timestamps are present in the dataset but for some timestamps, the measurement values are missing: a missing value is represented by the absence of value between two consecutive semi-colon attribute separators. For instance, the dataset shows missing values on April 28, 2007.

The selected features are: + Global_active_power: household global minute-averaged active power (in kilowatt) + Global_reactive_power: household global minute-averaged reactive power (in kilowatt) + Voltage: minute-averaged voltage (in volts) + Global_intensity: household global minute-averaged current intensity (in ampere)

Data pre-processing

The dataset contains some missing values in the measurements which we have to fix. Approximately, 1% of the rows. So, we need to replace missing with frequent values. Also, we need to make the range of data consistent, therefore we need to normalize the values so that all are in the range of 0-1.

Model Orange is configured to build the process of obtaining the k-means clustering.

Evaluation We need to be able to evaluate the quality of the clusters. We, therefore, use a metric called silhouette.

Deployment The output could be used for several purposes, for example, segmentation leads to a better ability to manage the different clusters using a differentiated pricing policy.

Steps to analyze the dataset

  • Open Orange on your computer
  • Download the data. You can access the data set here.
  • You will need to add six widgets to the analysis area. Those are CSV File Import, Data Sampler, Select Columns, Preprocess, K-means, and Scatter Plot. The widgets should be connected like in the below figure:

Explaining the widgets

CSV File Import: here you refer the widget to where you have stored the downloaded dataset in your computer.

CVS file report

Data Sampler: since the dataset is big, >2M rows, it is wise to only analyze a sample to save computational resources. However, try the same analysis process with removing the sampler and compare results.

Data sampler

Select Columns: part of the data understanding is to only focus on the columns of interest and that is what the select columns widget is providing us with.

Select columns

Preprocess:the dataset is not ready! It needs us to replace missing values with the frequent values as well as needs normalization as explained above.

Preprocess

K-means:here we run the k-means clustering algorithm on the dataset after we have prepared it.

Scatter Plot:here we visualize the output of k-means clustering data science algorithm. The two steps, k-means and scatter plot could be visualized together.

Share your thoughts

Now, we want you to change Silhouette scores, which is a measure of the quality of clusters, to 2, 3, 4, etc. and see the impact on the scatter plot.

  • Which value for k do you think is best?
  • Why?

Click on “File”, choose “Save as” an name it “task4”.

© Luleå University of Technology
This article is from the free online

Data Science for Climate Change

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now