Where is your data?
We have seen that data scientists ask questions with the hope of being able to answer them using data and apply software (or implement algorithms) to process data. The connection between questions and algorithms is data.
The data analysis part of data science generally involves applying algorithms (or methods or models) to data in order to investigate questions. All three need to be compatible in order to make any analysis possible, that is, the data must be able to be processed by the algorithm at a level of detail that can provide useful insights into the question.
So far we have looked at asking questions and breaking them down to make them clear and specific. In this step, we start to think more about the what, who, where and how of data.
The Oxford Dictionary of Statistics (Upton and Cook, 2002) defines data simply as ‘information, usually numerical or categorical’. Any observations or information that can be collected and stored can be regarded as data.
Data comes in many forms, such as measurements (heights, for example), counts (number of sales of a particular item), choices (which weblink has been clicked on), relationships (Facebook friends), text (tweets), audio (song track), images (satellite images), and video (CCTV). Some data is highly-structured (address labels for examples) but a lot of data is unstructured (social media).
There are many ethical and legal issues around data to be aware of. The main one is the General Data Protection Regulation (GDPR) which governs data protection, privacy and consent over personal data across the European Union from 2018. It is important to know who is the owner or controller of data, what it is permitted to be used for, when data must be anonymised prior to analysis, what will stop it being leaked and what procedures to follow if it is leaked.
The simplest location to store data is a computer filesystem. The type of data you’d expect in a filesystem is a text file, spreadsheet file, image file (JPEG for example), audio file (MP3), or video file (MP4).
Within a computer filesystem, a file could be physically stored on a hard drive or solid state device inside the computer, or on a USB flash drive, SSD card, external hard drive. For much larger datasets, network or cloud-based filesystems are cheap but will have much slower access speeds.
As data is processed by an algorithm, it is copied in smaller chunks into the computer’s main memory. These are physical RAM chips on the computer motherboard. The Central Processing Unit (CPU), the ‘brain’ of every computer, can copy data to and from main memory much faster than data on a hard drive, but the cost of main memory is much more expensive.
There is a trade-off between cost of storage and speed of access to data. Main memory is fastest but most expensive, hard drives are much less expensive but slower, and cloud storage is cheapest but slow due to transfer over the internet.
When the IBM Watson computer played the TV game show Jeopary in 2011 it stored all its data in main memory ‘because data stored on hard drives would be too slow to compete with human Jeopardy champions.
For small simple datasets we usually think of them as rows and columns on a spreadsheet. When a variety of types of data are involved, the files might be organised using some filename convention or folder structure. For anything larger or more complex, it is best to use a Database Management System (DBMS).
We have seen that data is the connection between our previous topics of questions and algorithms. We have looked at what forms data can take, that there are legal, security, privacy and ethical issues around who has access to it and for what purpose, where it is stored, and how it might be organised. The next step few steps will consider how data is physically stored and by what means we measure how much data we have.
Many people use social media to share ideas and information. A list of the most popular social networks shows that Facebook, YouTube and WhatsApp each have over 2 billion active users.
Select one of the social media giants and write a summary of the ways the what, who, where and how of data apply within the comments area below. Regarding ‘who’, consider the types of data they hold and any ownership, privacy or ethical issues they may have to deal with.
Clement, J. (2020). Global social networks ranked by number of users 2020. Statista. https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/
Code Exploit Cyber Security. (n.d.). IBM Watson cognitive artificial intelligence. https://www.codexploitcybersecurity.com/2018/04/ibm-watson-artificial-intelligence.html
Information Commissioner’s Office. (2018). Guide to the General Data Protection Regulation (GDPR). Gov.uk. https://www.gov.uk/government/publications/guide-to-the-general-data-protection-regulation
© Coventry University. CC BY-NC 4.0