Skip main navigation

What does a data scientist do?

Dr Ben Lloyd-Hughes, a data scientist, explains how he would approach a project using big data.
Hello. My name is Ben Lloyd-Hughes and I’m a Data Scientist at the Institute for Environmental Analytics. Today I’m going to take you through some of the key elements of data science. Phase 1, the problem statement. The first step in any data driven project is a clear statement of the problem at hand. What is it that we are trying to achieve? How will we know when we get there? How will we measure our progress along the way? It’s crucial to set out these ahead of time and get them agreed by all those involved in the project. The answers will serve both as your compass, guiding you towards the goal, and your parachute should things go wrong, which they will.
They provide an evidential basis to explain what hasn’t worked, point us to what you think might work in the future or what needs to be done before trying again. Phase 2, data acquisition. The next phase is data acquisition. Data is your raw material, so you’ll be needing as much of this as you can lay your hands on. Here I mean everything, from traditional datasets such as spreadsheets, reports, to the academic literature, to good old fashioned notebooks. Get out there and talk to as many people as you can who understand the problem, and the insights that might be gained from the data .These conversations will greatly assist you in the understanding the data, but more of that later on. Be dogged.
Don’t be afraid to chase people up on their promises to deliver. In this era of cheap computational storage, there is no such thing as too much data. Phase 3, data preparation. In the ideal world, data arrives via a well-documented API. Complete with comprehensive metadata, describing the provenance, quality, and previous use cases. It will provide details of costs, licencing and everything we need to begin using the data immediately. Sadly this is almost never the case, which means it will be incumbent upon you to wrangle the data into shape. Rather than viewing this as a chore, I tend to view it as opportunity to get to know your data better. Phase 4, data exploration.
Once I have a particular dataset in a computer readable form, I get to work with visualising the data. I favour hard copy, large format plots that can be festooned around the office. There’s only so much you can see on a screen at once, and having big pictures around the place makes for an excellent talking point, which often serves to elicit ideas from your colleagues and visitors to the office. A visual inspection is great for building our understanding of the data. It will highlight gaps, trends, outliers, discontinuities and all manner of interesting features. Now is the time to begin annotating the data, the links between particular sources and the relationships with the wider world.
The human brain is superb at picking out patterns, this is usually a good thing and sometimes it’s not. But these initial visualisations frequently form the seeds which will grow into ideas, which can be tested formally further down the line. Phase 5, model building. Here we explore the utility of the links identified during the initial exploration phase. Utility needs to be clearly defined, this is a key concept and it should have been agreed upon early in the project, during the problem definition phase. The types of model available are largely determined by the nature of the relationships, be they one-to-one, many-to-one, linear, non-linear.
The crucial thing is to keep in mind is to maintain a clear separation between the data used to train and test the models and the data used to validate the final solution. This statement is very easy to make but much harder to maintain in practice, especially when the data is scarce. It’s critical that you do not deceive yourself. Phase 6, documentation. A project is not complete until it’s written up. Documentation needs to be an integral part of the process. Every step needs to be version controlled along with your code. If you find yourself scrabbling under pressure at the end of the project then there might be a different way of doing it.
I’m not proposing to reinvent the wheel here, I’m simply going to direct you to the excellent book ‘Gorilla Analytics by Enda Ridge’. Dependent upon the definition of the project deliverables, it may or may not be necessary to publish the data and the final model via an API. My personal view is that this is well worth the effort, especially if there is any chance the initial project will grow into something larger. Remember our vision of an idealised data source? Well this is where we can help ourselves. Phase 7, publicity material.
Never underestimate the importance of pretty graphics. These are crucial for the presentation of our results, to the project stakeholders and form the basis of the marketing collateral that may well secure our next gig. It’s always easier to create this material at the time you are creating the initial graphics. And it’s always worth the effort to make things beautiful; well designed, well thought out colour schemes and scalable to any size.
Listen to IEA Data Scientist, Dr Ben Lloyd-Hughes, on the seven phases of a data science project. These are:
1. Problem Statement
2. Data Acquisition
3. Data Preparation
4. Data Exploration
5. Model Building
6. Documentation
7. Publicity Material
An example of a project Ben has worked on is WeFarm Weather & Climate Message Analysis. Ben collaborated with WeFarm, an SMS text platform for small scale farmers in developing countries, to analyse the content of texts between farmers referencing weather and climate to establish whether there was a need to extend the service, providing specific advice.
Simple visualisation such as word clouds illustrated the most common topics in around 160,000 messages sent between January 2014 and February 2016. A detailed semantic analysis, which had to allow for highly irregular grammar and spelling, revealed a clear demand for climate service information.
A word cloud which illustrated the most common topics from WeFarm. The largest words are drought and resistant
The analysis further indicated that there is potential value in an educational service to send geographically targeted advice messages of potential climate impacts when applicable and highlighted the opportunity for a ‘recommend-a-crop’ service.
This article is from the free online

Big Data and the Environment

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education