Data Science Work Flow
PreliminariesPreliminary steps are typically performed prior to the involvement of data scientists.
1. Decide on the ProblemThe problem is specified by the data scientist’s ‘client’. This may, of course, be a group or individual in the same organisation as the data scientist (or even the data scientist themselves if their role involves more than pure data science), but they are the people who wish to make use of the output of the data scientists work. This output is a statistical model which will be able to answer questions about a system being modelled. We look at statistical models abstractly in the Statistical Models, Loss Functions and Training as Optimization step and concretely throughout the last three weeks of the course, and we look at a number of different types of learning in the Types of Learning step, which gives an indication about the types of questions such models can answer.
Want to keep
The Open University online course,
Advanced Machine Learning
2. Acquire DataOnce the problem is decided upon, it is time to acquire data – even if that means simply work out which data is relevant in pre-existing data storage. At this point, the more data the better, and you should encourage the client to obtain as much data of as many different types and variables as possible. Data storage is cheap, and it is better to have data that you never use than to need data you do not have.
Type of Data
Structured DataStructured data is the form that most people think of when they think of data. A number of cases giving values of a number of variables, it can be thought of as a table. The variables are (typically) the columns and the cases the rows. Different areas are more or less likely to have immediate access to structured data. When training and using statistical models, you will (generally) use structured data.
Unstructured DataEverything that does not fit the neat definition above is unstructured data. Video, audio, images, text, websites, are just a small nunmber of examples of unstructured data. Since you typically need to work with structured data, starting with unstructured data complicates matters. You will need to extract features (see below) from the unstructured data and in this way obtain structured data that you can use with the machine learning techniques you plan to work with.
PreprocessingPreprocessing involves making the data ready to use in statistical models. It is a core duty of the data scientist, and many projects require the data scientist to spend more time in this step than in creating and evaluating statistical models. A joke of the profession is that a data scientist spends 90% of their time cleaning and preparing data, and 10% of their time complaining that they spend 90% of their time cleaning and preparing data.
3. Clean DataThis step involves finding and correcting/removing from the data corrupt, inaccurate or unusable values, as well as homogenizing the data. Homogenization means dealing with cases where different subsets of the data record information in different ways, such as using different units for real-valued variables, or different labels for nominal valued variables. Recogn41izing inaccurate values can require significant domain knowledge. Data cleaning may be undertaken by a data scientist, or instead be performed before the data is handed to the data scientist for analysis.
4. Extract FeaturesIn this step, we decide upon the features that will be used to characterize particular cases in our data. It may seem this step is trivial. If so, that is because you are thinking of structured data where the data already determines the features involved – they are the columns of the data set. But consider unstructured data, such as video, audio, images, natural language text, etc. In these cases, deciding what features should be extracted from the data, and extracting them, can be a time-consuming and difficult process. Typically, there are standard approaches used for different sorts of data – such as the bag-of-words approach used for text analysis, where cases are documents, features are individual words (from some selected vocabulary) and feature values give the number of time the word is present in the document. Some machine learning techniques automate components of feature extraction. For instance, traditional image analysis required substantial manual effort in feature extraction: Chosing the image-components to look for (such as lines, dots, etc) and deciding how to represent these components and their spatial relationships in such a way as to be usable for machine learning algorithms was a specialized and difficult task. These days, convolutional neural networks (a deep learning method not covered in this course) are often used for image analysis. They take as input the matrices of values representing the pixel colors in an image and proceed to extract higher abstraction features from this automatically.
5. Features EngineeringThe raw features involved in structured data or extracted from unstructured data may not be the best set of features to use. There may be too many to use (or too many to use in a suitable model), some may be of low quality, or some may be redundant. We typically want to keep models as simple as possible optimize their performance, something we will go into more detail about when we look at expected loss and the bias-variance decomposition later this week. Accordingly, it is important to select the smallest number of maximally informative features. Feature selection approaches do this by seeking to find an appropriate subset of the raw features to use with our models. Feature transformation approaches take a more sophisticated line and seek to transform our original features into a small number of highly informative features. We will look more at feature engineer in week 4.
6. Deal with Missing DataIn real life, data sets are often incomplete, meaning that they are missing some values. But most statistical modeling algorithms cannot work with incomplete data. So when facing such cases, the data scientist must either ‘complete’ the data, providing reasonable estimates of the values that are missing, or remove the cases that contain missing values. We will look more at how to deal with missing values in data, and techniques for estimating the values that are missing, in week 4.
Modelling ProcessOnce the data is ready for use in statistical models, it is time to turn to creating and evaluating these models. This too is a core duty of the data scientist.
7. Generate ModelsTypically a number of different types of models are created, so as to be able to evaluate which approach performs best on the problem at hand with the available data. This model choice is restricted by the type of problem (classification, regression, etc), the amount of data available, as well as desired characteristics of the model (speed of use, interpretability, ease of encoding domain knowledge, etc). It is in this step that the data scientist builds these models.
8. Select Final ModelOnce a set of statistical models have been created, they are evaluated. The best performing model is then selected. There are many choices to be made regarding how models should be evaluted, and we look at model selection and evaluation in week 2.
9. Evaluate Final ModelIf an unbiased estimate of the selected model’s performance is required, it is normally necessary to perform an additional step. We discuss the reasons for this in week 2.
ApplicationOnce the final model is selected and evaluated, it is time for the data scientist to turn it over to application developers who will incorporate it in an application.
10. Use ModelIf the final statistical model selected is evaluated as sufficiently high-performing, it will be used in whatever task it was designed for. This requires incorporating it into an application that will allow it access to the new data that it will be used on, and allow it to output its results to humans or other programs.
Advanced Machine Learning
Our purpose is to transform access to education.
We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.
We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.