Jon Blower

Jon Blower

Chief Technology Officer at the Institute for Environmental Analytics. We help all kinds of people to understand and make the most of environmental data!

Location Reading, United Kingdom

Activity

  • Yes, the common use of the term "API" has shifted a bit recently. Quite often when people say "API" these days, they mean a means to access data over the web, in a machine-readable way. These kinds of APIs are very important for accessing and integrating different kinds of data. (Commonly data may be supplied in a standard format such as JSON.)

    In order to...

  • Good spot, I will raise this!

  • Yes, both good points. Transparency is a big motivation for Open Data. And people do worry a lot about misinterpretation, and this has often been a justification for not publishing data. My own opinion is that, for most public datasets, the benefits of transparency and trust outweigh the worries over misinterpretation, which I think are often exaggerated. (If...

  • Welcome to the course! The use of Big Data to guide policymaking is extremely important. In the UK we have a 25-year Environment Plan, which will shape this area (https://www.gov.uk/government/publications/25-year-environment-plan). We'd love to use our data and expertise to contribute towards the plan, and it's a big challenge to work out how best to do this...

  • Absolutely right!

  • I agree too! The importance of storytelling is much overlooked. People generally respond well to narrative explanations

  • Great discussion. To add to this, there has recently been a move towards providing satellite data in "analysis-ready" forms. The idea is that some of the complexity of the data is removed, by doing necessary corrections and regularisation of the data up-front, thereby making it easier for a wider community to access. If you do a web search for "Analysis Ready...

  • @NiallGeraghty definitely, data ought to be provided in machine-readable forms (not PDFs - agh!). But when we are considering data volumes in the terabyte and petabyte scale, we can't give it all to the user. There's a big challenge in working out how much is "enough" for different user types.

  • Great point. It's certainly true that data are sometimes technically "open" but in practice are still very difficult (and hence expensive in terms of time) to use. Lots of people are working on this, but I personally think that "usability" is an aspect of open data that hasn't been looked at enough.

  • Nice example - I bet many areas of the sports industry are ahead of most other sectors in terms of data analytics

  • Hi Jim, this is a really great point. At the IEA we are working on a renewable energy planning platform aimed at developing island states. Slow internet speeds and lack of access is certainly a problem. (It can also still be a problem in the "first world", in rural or inaccessible areas.) This is where cloud computing can play a role, to do all the storing and...

  • Wonderful to see so many learners from so many different countries! I hope you all enjoy the course.

  • Nice set of applications here!

  • Yes. There are lots of examples of using satellite data in non-commercial (and commercial) settings for environmental improvement. Satellite imagery remains the best way of monitoring large-scale deforestation, for example. It can be used to provide evidence to ensure that landowners and farmers adhere to policies (e.g. on allowing land to lie fallow and...

  • You make a very valid point about the possible drawback. It's interesting that people react to computer-based systems like this in different ways. The feedback we've had on this system has been very positive overall - I think most people like to see this kind of thing more than reading the same information in lengthy reports!

  • Great suggestions everyone!

  • Jacqueline - this is very nicely put. I think you have grasped the main point, which is that visualisation is not just about "pretty pictures" (dissemination) but also about aiding discovery.

  • I think it depends on how you use it. It may be easier to think of visualisation as an *act* rather than a *thing*. For example "the visualisation task I am doing now is observational in nature", rather than "this thing is an observational visualisation". Does that make sense?

    Maybe my answer to "Neil AT" above might help answer this too?

  • The way I see it personally is this - if you're just making some kind of a plot, without trying to answer a particular question, this is "observational visualisation". To me, it just means "taking a look at the data and getting a feel for it". What is the resolution? Where are the gaps? Is there clear structure in the data or is it "noisy"? It's often the...

  • To give an example: imagine that I look at my thermometer in my garden and it says the temperature is 50 degrees Celsius. Do I believe it? Probably not, because my (mental) model of the world tells me that the temperature is never this high in March in the UK. My model is correcting my observation. In reality, we combine many sources of information with models...

  • And to complicate things further, don't forget that data (observations) also have errors. Introducing a (good) model can actually reduce the errors. And you need some kind of a model (simple or complex) to give you an estimate of what is going on *between* the observations, i.e. where you have no data. A key point is to understand how much *information* there...

  • Really interesting points. There are so many types of model it's hard to know where to start. A model is just a view of the way something works - a relationship, or set of relationships, between things we observe. It might be something really simple (like a linear relationship between two things) or something extremely complicated (like a weather forecast,...

  • Great point. I think the increased provision of "Analysis Ready Data" is a big plus point for users. Of course, some people will still want to go back to the raw data if they have very specialist applications, but ARD opens data up to a much wider audience.

  • Absolutely - and I think this is really the core of "data science".

  • Any kind of data storage is a trade-off between the cost of storage and the speed of retrieval. Tape is very cheap for storage but slow to retrieve - but it may be the only viable option for very large archives that we don't need to access very often. (Tapes are much better than they used to be though!)

  • Great points - data storage and processing are not free, either in terms of cost or environmental impact.

  • "Veracity" means reliability, in the sense of accuracy. "Velocity" means how fast the data are collected or transferred.

  • Yes, there is a big issue of communication and social science (how do people react to evidence) as well as generating the evidence from data.

  • Well I can't speak about economic models, but we *can* predict the weather, albeit imperfectly. We can show that average weather forecast accuracy (known as "skill") is increasing with developments in data acquisition and modelling. You are absolutely right that it is very hard to get a grip on the uncertainties of complex models, and this is a real challenge....

  • Yes - increasingly commercial vessels have sensors for lots of things, including weather and current. And scientists have attached other instruments to ferries and other vessels to measure many things (these are called "ships of opportunity").

  • Great point. Some things (like air pollution and noise) are very challenging to measure because they vary so much on small spatial scales. A measurement in one place could be very different from a measurement only a few metres away.

  • It's a great point that just because we have *lots of* data, doesn't mean we always have the *right* data for a problem.

  • It's hard to compare the capability of a big computer with that of the human brain because they don't work in quite the same way. Your brain can do plenty of things that even the most powerful computers can't (yet)! One example is recognising familiar faces, which humans do better than computers. (Unfamiliar faces are a different story...)

  • The weather forecast *is* getting more accurate, but the weather is a chaotic system that can never be predicted with 100% accuracy. A crucial part of getting the weather forecast right is understanding what is happening *now*. This is where data comes in. We use data from satellites, weather stations and other sources to get the best picture of the current...

  • Same here - I hadn't heard of it but will check it out!

  • Hi Cristina - we do not have observations of everything we would like, unfortunately. If we had a higher density of sensors we could certainly do a lot more. There is a lot of interest at the moment in deploying sensor networks in different locations, particularly cities ("Internet of Things" is the key buzzphrase!). There is a cost to this of course, both in...

  • Hi Cristina - the problem is one of scale. Fog can be caused by highly local conditions, which are not always detectable by the information we have on weather. If we had extremely fine-grained weather information it would be easier to use this approach.

  • I hadn't heard of brontobytes or geopbytes! We'll have to update the infographic...

  • Yes, that's right, which is why it's often not worth the effort to delete it. (Because you would need to develop and implement procedures to decide what to keep.) But it's worth noting that sometimes we *should* delete old data, e.g. if it is personal in nature and no longer relevant (because of Data Protection).

  • That's a very interesting case. Last week I came across a new startup company in Portugal (BitCLIQ), who are using blockchain technology to digitally assure the traceability of seafood products. I'm sure the same ideas could be applied to the supply of any goods, including aircraft parts.

  • It's worth noting that you need a licence even if the data are made available for free to everyone (e.g. a Creative Commons licence). A licence does not necessarily mean that money changes hands. However, sometimes the choice of licence is governed (or informed) by the original conditions of the funding that led to the research.

  • Vanessa - interesting idea. One issue in research is that the value of the data may not be known for a very long time after collection. And we already have to make some tough decisions about how much data to keep. But I do like the idea that there could be a model in which costs are deferred until the value is better known. It's a bit like getting a loan...

  • Nuno - I saw an interesting presentation from a data centre a while ago, where they gave another reason why they keep everything. If the archive is growing exponentially, then there is little point in deleting old data, as it represents such a small fraction of the archive. It's easier just to keep everything!

  • Great example - I really like the way it starts with a "story", but then lets you explore the data in your own way

  • Firstly, congratulations on getting to grips with Python/R/etc! It's not an easy thing to move from something like Excel to a programming language, but it gives a massive increase in power and flexibility. Regarding data size, having data that are "too big to download" is a useful working definition of a Big Data problem! If the data provider doesn't provide...

  • Absolutely - "domain knowledge" is very important. Sometimes this means you need assistance from someone else, who knows the domain best.

  • All good points, and point 2 is particularly useful - there can be a lot of work in "cleaning" and filtering datasets to get a workable starting point

  • Yes, it is definitely a team game. No single person is going to have all the skills and experience needed. Skills in consultancy, requirements analysis, project management, maths, stats, scripting, software engineering, graphic design and many more all have to come together on a successful project, all held together by good communication skills.

  • I've noticed that the usage of the term "API" has shifted. Before the Web was pervasive, "API" was used to mean a set of high-level functions in a software library, which can be assembled to build an application. Nowadays we tend to use "API" to mean a means to access (and change) data programmatically over the web.

  • I don't endorse any particular website, but I'm aware of https://datascientistjobs.co.uk. Jobs.ac.uk will also frequently have relevant jobs (in the academic sector). Other places I would look include New Scientist jobs and LinkedIn.

  • Very interesting point, thanks. In many respects, lots of people do "data science" in a variety of contexts and roles, without explicitly calling it by that name. I guess the coining of the term "data scientist" reflects the generally-increasing need across many industries for decisions to be informed by data.

  • An excellent question with no easy answer other than very careful work!

  • Those skill are certainly extremely important and we value them highly at the IEA. It depends on if you're going to be "hands on" with the data - if you are, it's difficult to get away from the need for good computing skills (and maths/stats). Domain knowledge is also very valuable, to understand how the data should best be used.

  • Yes, and there are quite a few fully open journals you can choose from these days (depending on your field of course), although you usually have to pay to enable open access. Open Access makes your work fully open to everyone, but the author has to pay. There's an interesting question about whether this model effectively excludes authors from institutions that...

  • Personally I don't think the terms are very well defined, and often mean different things to different people. I would not take any particular definition too seriously. Perhaps a professional body should step in and try to formulate some proper definitions.

  • There's a very interesting question around who should be responsible for installing new sensing infrastructure. A single use case may not be enough to justify the cost (which, as David says, includes both the hardware and the data storage), but if multiple use cases can be found, the business case can possibly be made. Should sensor networks therefore be...

  • The IEA website and Twitter feed will no doubt publish updates as they appear, if we're able to make progress in this area. (I can't take credit personally for this piece of lateral thinking by the way!) You may also be interested in the studies by Overeem and colleagues into the use of mobile phone signal strength data to calculate rainfall patterns (another...

  • Great point - the context around the data is often not preserved and this can be really important for future reuse. Quite a few initiatives are looking at this problem, e.g. from the point of view of linking datasets to publications, user feedback etc, to help future users.

  • There's more about this in Week 2, "The Role of the Data Scientist". I hope that will be interesting for you! People arrive in this kind of career from all kinds of directions.

  • That's wonderful. It was certainly one of our aims in creating this course to show that it's possible to follow new and exciting careers emerging from the environment and data science.

  • Yes, this is an important point about the use of data in general. We like to think of data as being multi-purpose and entirely objective, but in reality a certain amount of domain knowledge is usually required to interpret them correctly. Measures of statistical confidence are one way to address this, but we may also often have knowledge of the fundamental...

  • Great point that peer review is not generally sufficient to guarantee data quality. These days there are a few "data journals" that specialise in data publication. Even then it must be very hard for reviewers to sift through large datasets, but it is certainly possible to check that a dataset is well-enough described.

  • Yes. There is a lot of work going on in this area, and the issue of using widely-understood licences is part of this. We have suffered in the past (and still do) from unclear data licensing, where the licence attached to a dataset is either not present or not easily understood. I'm not an expert in this area, but I understand that there is increasing...

  • Yes, and this comes back to the point on ethics that was made in the comments on one of the other activities. Users should indeed be informed about how their data will be used.

  • Yes, and I believe this is usually the case - that "sensitive" data are not made open. National critical infrastructure is one example.

  • Good questions Matt. I should clarify that the project was a very short proof-of-concept, and much more validation would need to be done with more data. But early results were promising and we hope to follow this line of enquiry further.

  • Yes, and "big data" concerns can still be relevant even at the scale of a laptop. If the data you work with are larger than your available RAM, you have to think carefully about how to process it. You might have to employ algorithms that are different from the ones you are used to. If you are a Python programmer, you could look at the Dask library, for instance.

  • Yes, agreed, and this is also fast-moving

  • Lots of terrific examples here, thanks for all your contributions!

  • Great quote! I hadn't seen that one before but it's going on my list!

  • Yes, this is a really interesting capability. You can find more about the technique from Dr Aart Overeem's webpage, and the literature linked from it: http://www.wur.nl/en/show/Rainfall-measurement-in-the-city.htm. Also there is this article, also by Overeem et al: http://www.pnas.org/content/110/8/2741.full.

  • Great to see so many different people from so many different countries and backgrounds! I hope you all enjoy the course and can learn a lot from each other as well as from the activities.

  • Yes, agreed. Good visualisations are part of this, as we will discuss later in this course. It's a huge challenge to present (potentially) highly-technical data to decision-makers in a way that is meaningful.

  • Yes, this is a very good point. Certainly such training will be very important and more "big data" content is appearing in University courses and postgraduate training. It's a big challenge to design appropriate training programmes in such a fast-moving field.

  • Great point - look out for Dr Friederike Otto in week 3, who will talk about climateprediction.net, in which home computers are used to run climate simulations.

  • Yes, very good point, thanks!