Hello. My name is Ben Lloyd-Hughes and I’m a Data Scientist at the Institute for Environmental Analytics. Today I’m going to take you through some of the key elements of data science. Phase 1, the problem statement. The first step in any data driven project is a clear statement of the problem at hand. What is it that we are trying to achieve? How will we know when we get there? How will we measure our progress along the way? It’s crucial to set out these ahead of time and get them agreed by all those involved in the project. The answers will serve both as your compass, guiding you towards the goal, and your parachute should things go wrong, which they will.
They provide an evidential basis to explain what hasn’t worked, point us to what you think might work in the future or what needs to be done before trying again. Phase 2, data acquisition. The next phase is data acquisition. Data is your raw material, so you’ll be needing as much of this as you can lay your hands on. Here I mean everything, from traditional datasets such as spreadsheets, reports, to the academic literature, to good old fashioned notebooks. Get out there and talk to as many people as you can who understand the problem, and the insights that might be gained from the data .These conversations will greatly assist you in the understanding the data, but more of that later on. Be dogged.
Don’t be afraid to chase people up on their promises to deliver. In this era of cheap computational storage, there is no such thing as too much data. Phase 3, data preparation. In the ideal world, data arrives via a well-documented API. Complete with comprehensive metadata, describing the provenance, quality, and previous use cases. It will provide details of costs, licencing and everything we need to begin using the data immediately. Sadly this is almost never the case, which means it will be incumbent upon you to wrangle the data into shape. Rather than viewing this as a chore, I tend to view it as opportunity to get to know your data better. Phase 4, data exploration.
Once I have a particular dataset in a computer readable form, I get to work with visualising the data. I favour hard copy, large format plots that can be festooned around the office. There’s only so much you can see on a screen at once, and having big pictures around the place makes for an excellent talking point, which often serves to elicit ideas from your colleagues and visitors to the office. A visual inspection is great for building our understanding of the data. It will highlight gaps, trends, outliers, discontinuities and all manner of interesting features. Now is the time to begin annotating the data, the links between particular sources and the relationships with the wider world.
The human brain is superb at picking out patterns, this is usually a good thing and sometimes it’s not. But these initial visualisations frequently form the seeds which will grow into ideas, which can be tested formally further down the line. Phase 5, model building. Here we explore the utility of the links identified during the initial exploration phase. Utility needs to be clearly defined, this is a key concept and it should have been agreed upon early in the project, during the problem definition phase. The types of model available are largely determined by the nature of the relationships, be they one-to-one, many-to-one, linear, non-linear.
The crucial thing is to keep in mind is to maintain a clear separation between the data used to train and test the models and the data used to validate the final solution. This statement is very easy to make but much harder to maintain in practice, especially when the data is scarce. It’s critical that you do not deceive yourself. Phase 6, documentation. A project is not complete until it’s written up. Documentation needs to be an integral part of the process. Every step needs to be version controlled along with your code. If you find yourself scrabbling under pressure at the end of the project then there might be a different way of doing it.
I’m not proposing to reinvent the wheel here, I’m simply going to direct you to the excellent book ‘Gorilla Analytics by Enda Ridge’. Dependent upon the definition of the project deliverables, it may or may not be necessary to publish the data and the final model via an API. My personal view is that this is well worth the effort, especially if there is any chance the initial project will grow into something larger. Remember our vision of an idealised data source? Well this is where we can help ourselves. Phase 7, publicity material.
Never underestimate the importance of pretty graphics. These are crucial for the presentation of our results, to the project stakeholders and form the basis of the marketing collateral that may well secure our next gig. It’s always easier to create this material at the time you are creating the initial graphics. And it’s always worth the effort to make things beautiful; well designed, well thought out colour schemes and scalable to any size.