Big data, wicked problems
As computer hardware gets cheaper and faster, we can afford to collect and store more and more data. Does this mean that it will get easier to answer all kinds of questions by using data?
Have you noticed that each generation of digital devices (phone, camera, tablet or laptop) seems to take a great leap forward in terms of speed and capacity over the previous generation? In computer science, the well-known Moore’s Law predicts a doubling of speed and storage of new devices approximately every two years. Despite these improvements, there are still significant challenges in what can be calculated, but also new opportunities.
The term ‘Big Data’ simply refers to data that is either high volume (there is a lot of it), high velocity (it moves very quickly), high variety (some of it might be structured but much of it is unstructured) or some combination of these. These are the ‘three Vs’ of Big Data, first described by Laney (2001). All of these aspects pose challenges in the analysis of data.
This means that large amounts of data are usually stored ‘in the cloud’ rather than on local computers. Physically ‘the cloud’ is composed of a number of data centres, ie large buildings containing networked computer servers, each with a lot of data storage. The China Telecom Data Centre in Hohhot has one million square metres of floor space, for example. To process large amounts of data distributed across several data centres, most of the processing must occur where the data is, rather than copying data across the internet.
This means it may not be possible to store all the data available from a live stream. For example, for a live CCTV camera it is unlikely to be possible to process every video frame to detect faces, so only a small proportion of frames would be stored and processed.
This means that a lot of data is not clearly structured. For example, analysis of tweets from Twitter contain natural language, often in an abbreviated form. Similarly, images are composed of a large number of pixels.
Big data also presents opportunities
More data means that we have more information on which to base conclusions and subsequent decisions.
A lot of data is collected about individuals that can be used to make conclusions about individuals. For example, what ads will be displayed on webpages you visit, based on your click history.
Recommendation systems gather reviews from customers and aggregate these reviews to rate things like products, restaurants and holiday venues.
Social media contains data indicating networks of links between people and organisations. For example, information about friends could be used for credit scoring.
Integration of data from many sources. For example, in healthcare, information from your GP medical records could be integrated with activity monitoring on your smartwatch to monitor your health.
Even with the ability to store and process large amounts of data and continuous improvements in computer processing power, there will always be questions that are easy to write down but incredibly difficult – sometimes practically impossible – to solve.
The BlueGene/L supercomputer at the Livermore National Laboratory in Livermore, California - Credit:Kim Kulish / Contributor / Getty Images
Modern cryptography systems rely on the difficulty of factorising very large numbers into a product of prime numbers.
Many optimisation problems, such as building university timetables, hospital staff rosters and routes for delivery vehicles, can take a very long time for the computer to search for a good answer, even though only a small amount of data is needed.
A huge number of possible combinations need to be tried and tested in order to come up with the answer to both of the above. Finding an answer in a reasonable amount of time can be impossible.
We have seen that big data presents both challenges and opportunities for data science, but there will always be extremely difficult questions that lots of data alone may not be able to answer in a reasonable amount of time.
Identify one challenge and/or one opportunity in applying big data in the NHS and share it with your fellow students in the comments area.
Agrawal, R. & Prabakaran, S. (2020). Big data in digital healthcare: lessons learnt and recommendations for general practice. Heredity, 124(4), 525–534. https://doi.org/10.1038/s41437-020-0303-2
Laney, D. (2001). 3D Data management: Controlling data volume, velocity, and variety. META Group.
Moore’s Law (n.d.). Moore’s Law: How overall processing power for computers will double every two years. http://www.mooreslaw.org/
© Coventry University. CC BY-NC 4.0