We use cookies to give you a better experience. Carry on browsing if you're happy with this, or read our cookies policy for more information.

Skip main navigation

Big data storage: CEDA

Big data often needs to be stored and processed using a dedicated storage facility. Watch Dr Victoria Bennett, Head of CEDA, explain more.
My name is Victoria Bennett and I work at the Centre for Environmental Data Analysis or CEDA. So within CEDA we run data archives, so we store large amounts of data for the science community to save them having to get hold of it all themselves. But alongside that, we run an analysis environment called JASMIN. The JASMIN infrastructure offers a cloud environment for our users. And we also offer software as a service, where people can access our data through web interfaces and software tools. So it allows people to do their processing and their analysis right next to the data. Today we got around 17 petabytes of storage and over 5,000 processing cores. It’s really heavily used. The storage is nearly full.
The processing is almost at 100% almost all of the time. So we’re providing a service that obviously is very much needed. And we hope to grow it as the demand continues to increase. For me, the most interesting thing with our big data is just the variety of uses for which people come to us. So it really is everything in science, from archaeology to bird habitats, looking at hurricanes and storm tracks, understanding ocean currents, just such a wide range of things. And they bring in data sets from all sorts of places and do their combinations and do great science with all the stuff that we offer. So it’s really interesting, the diversity of uses.
The biggest data sets that we hold in CEDA at the moment come from satellites. So the Copernicus Sentinel series of satellites, they’re producing huge amounts of data. And we’re taking all of it and archiving it for our users. And it’s around 150 terabytes per month right now, and increasing all the time as more satellites are going up. The amount of data that’s added each day does depend on what we’re working on at any one time and with particular projects we’re supporting. I’d guess around 5 terabytes a day, at the moment. But sometimes we’ll peak higher than that if we need to bring in a huge amount of data very quickly.
And sometimes there might be a bit of a lull in activity, but it’s probably that order of magnitude. We rarely delete data off our archives. There’s a few reasons for that. For the scientific record, it’s important that we don’t get rid of data that’s being used in order to draw conclusions about what’s happening to the environment. It’s really important that scientific conclusions are traceable back to the evidence on which they’re based. The other thing is that actually going back and evaluating which data sets are worth keeping and which ones are worth deleting costs more, in terms of time and money, than just continuing to store them. Almost all of our data is backed up.
If we’re not the only place in the world where it’s kept, sometimes we don’t bother keeping a backup, because people can go back to the original source to get those datas. So if we’re storing the data that’s just a copy of data elsewhere and we find a problem with it, a corruption or missing bits of files, then we might get rid of it as a whole. But normally we keep everything, because it’s more efficient for us and the users might grumble if we get rid of something that they’re using.
The challenge that we have at the moment is that the data rates are increasing so fast, the data volumes are growing so quickly, that we, and probably this is the same in many organisations around the world, are finding it’s becoming unsustainably expensive to store everything on this technology that we have right now. It is very likely that in the next year we’ll be moving to a new storage technology. And that will involve quite a lot of changes and quite a lot of research. And the team are really excited about doing things in a different way, because that gives them a good problem to get their teeth into and find solutions to.
The Centre for Environmental Data Analysis (CEDA) has been working with big data for 22 years – since before the phrase ‘big data’ was coined. Watch Dr Victoria Bennett, Head of CEDA, explain how storing and providing access to data can support environmental science.
CEDA is focused on atmospheric and Earth Observation data from satellites, climate model simulations, meteorological observations, aircraft and ground-based measurements, as well as datasets produced by scientists using these raw data as inputs.
Data is added to CEDA in different ways – even arriving in the post on a disk or as an email attachment. Some data are routinely pushed to the organisation from data providers, others are pulled in via a satellite receiver dish, or over the internet as a FTP (File Transfer Protocol). CEDA also has a facility for data producers to upload their data, which are checked before being added to the archives.
CEDA not only holds data for users but also provides an analysis environment. Adjacent computers allow users to analyse, process and investigate the data, without moving them around networks, and without needing to store them on their own computers. This avoids multiple copies of the same data in different institutes, saving both time and money.
As Victoria Bennett mentions in the video, JASMIN is a petabyte scale data storage and analysis infrastructure run for users to exploit the data held by CEDA. There are currently 17 petabytes of storage, over 5,000 processing cores and fast networking so users can efficiently access, analyse and process the data. This offers different types of computing: a private cloud and hosted processing for expert users (infrastructure as a service, and platform as a service) and software as a service: web interfaces and services to access our data.
Over 1,000 users regularly log in to JASMIN to perform analyses of the data remotely, but over 30,000 users regularly access the data to view the data catalogue. This includes research scientists, but also users from other sectors (eg government and industry), depending on the data’s licences.
JASMIN users work on a huge range of science projects, including earthquake monitoring, simulating hurricanes, measuring greenhouse gas emissions, analysing high resolution global and regional climate models, understanding ocean currents and modelling air pollution.
This combination of storage, processing and networking offers a range of options and benefits for different types of user needs. Jon Blower will discuss these later this week.
This article is from the free online

Big Data and the Environment

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education