The four V's of data
Let’s look at some of the key points when storing large volumes of data and the associated challenges.
High volume data (petabytes): Victoria talks about data size in her video, CEDA currently have 17 petabytes (PB) of storage which is added to all the time (revisit the infographic in Step 1.4, if you need a reminder of the different byte sizes). Data volumes are predicted to grow and the CEDA team is trialling new systems for storage, moving to tape rather than fast, expensive disk. Tapes may not be accessible immediately and have to be retrieved by a robot, and it may take longer to find the right place on the tape for the required data compared with disk. Those who have used tape (eg when music or video was on tape!) will appreciate some of the issues.
Variety: there are hundreds of datasets (nearly 500 collections), more than 150 million files, different suppliers, formats and uses. Topics for data held on CEDA range from bird habitats to archaeology. Much of what CEDA do is around making sure the data can be discovered, understood and used in years to come, without needing to contact the original data supplier.
Velocity: the fastest incoming dataset today is Sentinel data from the Copernicus Sentinel satellites, bringing in over 150 terabytes (TB) each month and with peak daily intakes of about 5 terabytes (TB).
Veracity: refers to the reliability of the data. Data may be biased, or noisy or contain outliers, making it difficult to analyse. The sources of these problems may be discovered and corrected, for example a faulty instrument gathering data, or remain in the data undiscovered. At CEDA, despite the pressure on storage, datasets are not deleted partly as a record is needed if an ‘incorrect’ dataset was used in research and partly because it would take a great deal of effort to decide which to delete.
© University of Reading and Institute for Environmental Analytics