Why is it challenging to analyse?
With your knowledge of the different characteristics of big data and of the many possible sources where data can come from, you can probably imagine that it is not straightforward to analyse big data.
The majority of big data is used for commercial purposes to increase profits, provide better services, or gain competitive advantage. Thus, organisations are hesitant to share their data with outsiders. Even when organisations allow access to their data, they usually restrict access to certain portions of the data or impose rate limits on the amount of data that can be accessed per day or user. This makes it difficult for researchers and non-profit organisations to obtain data, but also for organisations to integrate their own data with other organisations’ data. However, many countries nowadays promote ‘open data’ portals, where datasets are made available to the public.
Inconsistent and incomplete data
Even though we are collecting more data than ever before, the overall quality of our data has not increased. The percentage of incorrect or incomplete data points remains the same. For example, take electronic sensors that record an incorrect reading once in 1000 readings. As the frequency of readings increases by a factor of ten to 10,000, the number of incorrect readings also increases to ten. Written text on the web will also always include spelling mistakes. As the amount of text posted increases, it may even contain a higher percentage of mistakes. Therefore, data cleaning becomes an important task for big data analytics.
Heterogeneity of data
Heterogeneity of data refers to how much the data differs across the dataset we are looking at. This can include differences in data format, number of missing values, level of detail, or length of time period for which data is available.
Heterogeneity is a particular issue when we bring together data from unconnected sources. For example, it may be useful to connect population data from government sources with data from environmental sensors to determine action towards a drinking water management plan for a city. The data from these different sources will need to be carefully matched to ensure valid analysis results.
Data privacy and protection
More and more data is stored about personal interests, behaviours, and attitudes. While consumers often trade their personal data for a product customised to their liking, their privacy needs to be protected by clear policies. In addition, the results of analysing personal data, perhaps from multiple sources, may be more sensitive than all the individual parts. As Aristotle already said: ‘The whole is greater than the sum of its parts’.3
Data privacy and protection is not just important for individuals. Organisations also need to have their data and intellectual property protected by policies and laws
In the next step we will look at an overview of the data analytics cycle used to solve a problem or get new insights based on data.
Can you actually find data from any of the big data sources you (or other learners) identified in the previous step?
Share your thoughts in the comments.
Batrinca B, Treleaven PC. Social media analytics: a survey of techniques, tools and platforms. AI & Society 2015; 30(1): 89-116. ↩
EMC Education Services. Data Science and Big Data Analytics. Wiley; 2015. ↩
Goodreads. Aristotle > quotes > quotable quote [Internet]. Available from: https://www.goodreads.com/quotes/20103-the-whole-is-greater-than-the-sum-of-its-parts ↩
© Griffith University