Skip main navigation

Importance of Data Quality

Learn more about the importance of data quality.
According to this picture, there are six attributes for data quality: accuracy, relevance, accessibility, completeness, clarity, and timeliness.
© Kilkenny, M.F., and Robinson, K.M., 2018. Data quality: “Garbage in–garbage out”

You might feel dizzy after learning too much. You may focus on the best method that answers your question. However, the most challenging step is not choosing the most appropriate machine learning method.

The foremost step is the tedious process of collecting and cleaning the data. You must obtain the correct data. Then, you have to filter the data with suitable criteria. Only if these steps are done, you can use descriptive or predictive modeling.

The article that we will read today is originally from Kilkenny, M.F., and Robinson, K.M., 2018. Data quality: “Garbage in–garbage out.” The article emphasizes the process of data collection. Even though you use an excellent method, the results will be wrong if the data is wrong. After spending many hours fighting with Rstudio or any programing, it would be dreadful to realize that the data was wrong.

The result can differ if you change the input data. It will be extremely annoying to find that your original answer was wrong because the input data was wrong. You have to re-do the whole thing using the new input data. Of course, that could prove that your “good” answer was in fact, wrong.

Although the data collection and cleaning processes are the most boring part of the data mining process, it is the first thing to make sure for the right data analysis. Yes, it takes a lot of time.

The article “Data quality: Garbage in- garbage out” suggests four criteria of useful data: accuracy, validity, completeness, and availability. The accuracy criterion is the important one. If your data is accurate, then you are at a good starting point. Accuracy includes both wrong and missing data. For example, data saying 5 while it should be 5.5 is wrong. Also, you do not want too much missing data. You should also consider when the data is announced. We often deal with the data that has a timestamp. We want to have the exact time information.

Remind yourself of the colloquial “Garbage in – Garbage out.” The process of collecting the right data is demanding but it should be done correctly. Otherwise, there is no reason to do any analysis.

© Kilkenny, M.F., and Robinson, K.M., 2018. Data quality: “Garbage in–garbage out”
This article is from the free online

Artificial Intelligence and Machine Learning for Business

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education