Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. T&Cs apply

The standard process for data mining

What is the standard process for data mining?In this article, Dr Ming Yan discusses his recent research.

A generic data mining processing flow can be summarized in the following steps.

Data Acquisition

The collection of big data means receiving data from the client and the user can perform simple query and processing work on this data. In the process of big data collection, the main challenge is the high number of concurrency, because there may be thousands of users to access and operate at the same time, so a large number of databases need to be deployed on the collection side to support. Representative tools include Flume, Kafka, etc.

Data storage

With the emergence and rapid development of the Internet, coupled with the large-scale use of digital devices, today’s data is mainly generated automatically by devices, servers, and applications. Machine-generated data is growing geometrically, such as genetic data, various user behavior data, location data, images, videos, weather, earthquake, medical data, etc. In recent years, the technology of storing and analyzing big data on the Internet by extending and encapsulating Hadoop has become more and more mature. Representative tools include HDFS file system, HBase column database and so on.

Extract Transform Load ETL

In data collection, to effectively analyze these massive data, but also from the front end of the data imported into a centralized large-scale database, or distributed storage clusters, and on this basis to do some simple cleaning and preprocessing work. The challenge of extract transform load in the era of big data is mainly the large volume of imported data, and after completing the computation, analysis and mining, do visualization and display or interact with other business systems. Typical ETL tools include Sqoop, DataX, etc., which can meet the needs of different platforms for data cleansing, exporting and importing.

Data Computation

Big data computing is mainly reflected in the rapid statistics and analysis of data. Statistics and analysis mainly use distributed databases or distributed computing clusters to carry out ordinary analysis and classification of the massive data stored in them to meet most common analysis needs. Common tools include MapReduce distributed parallel computing framework, Spark in-memory computing model, Impala big data interactive query analysis framework.

Data Analysis and Mining

There are also some differences between data mining of big data and traditional data mining methods. First of all, under the big data platform, the volume of data puts forward higher requirements on the timeliness of mining; secondly, the volume and diversity of data reduces the absolute computational accuracy requirements of the model, and better computational accuracy can be obtained by the relative computational accuracy enhancement in processing data; finally, data mining under the big data platform can be done without a predetermined theme, and mainly on top of the existing data based on the Various algorithms of calculation, so as to play a predictive effect, to achieve some high-level data analysis needs. Commonly used tools include Mahout, MLlib and other data mining and machine learning tools.

Data Visualization

For data analysis, the most difficult part is data presentation, interpreting the relationship between data, and conveying and communicating data information clearly and effectively. Big data visual analysis aims to use computer automation analysis capabilities at the same time, fully explore the cognitive ability of human visualization of information advantages, people, machines, their respective strengths of organic integration, with the help of human-computer interactive analysis methods and interactive technology, assisting people to more intuitive and efficient insight into the information behind the big data, knowledge and wisdom.

In the era of big data, there are many sources of data, and most of them are from heterogeneous environments. Even if the data sources are obtained, the completeness, consistency and accuracy of the data obtained are difficult to guarantee, and the uncertainty of data quality will directly affect the scientificity and accuracy of visual analysis. Data visualization has been integrated into the whole process of big data analysis and processing, and has gradually formed the theory of big data visual analysis based on data characteristics, oriented to the data processing process, and targeting the results of data analysis.

Your task

How many steps are there in the standard data mining process?

Share your thoughts and ideas in the comments below.

© Communication University of China
This article is from the free online

Introduction to Digital Media

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now