Data Science Work Flow
We have found it useful to consider the workflow of a data science problem in terms of the following, idealized, diagram:
The blue rectangles are steps we will examine in this course.
This diagram is idealized primarily because the process is seldom completely linear in practice. Decisions are made that lead to less than ideal outcomes, and they are then revisited. You may, for example, select features that fail to produce suitably high performing models. You would then seek to examine the effects of choosing different features. Or you may decide that more data is required. It is also idealized in that some steps, such as dealing with missing data, may require repeated iterations of model building and use refining the data being used.
Nonetheless, the workflow is a helpful to real world application, and we go through the various steps in this article. Depending on your prior experience, you may find explanations of some steps may more sense once you have taken later sections of this course.
Preliminary steps are typically performed prior to the involvement of data scientists.
1. Decide on the Problem
The problem is specified by the data scientist’s ‘client’. This may, of course, be a group or individual in the same organization as the data scientist (or even the data scientist themselves if their role involves more than pure data science), but they are the people who wish to make use of the output of the data scientists work. This output is a statistical model which will be able to answer questions about a system being modelled. We look at statistical models abstractly in the Statistical Models, Loss Functions and Training as Optimization step and concretely throughout the last three weeks of the course, and we look at a number of different types of learning in the Types of Learning step, which gives an indication about the types of questions such models can answer.
Ideally, the problem specified is one that the client already wants to answer. While there is a Place for exploratory data analysis, you should be cautious if this is requested by inexperienced clients. Unfortunately, data science is particularly hyped at the moment and many organizations feel that they should be using their data in some way, and so want to find some way to use their data. The danger of this is that the data analysis results in answering questions that turn out to be of little value to the client.
Likewise, ideally the question should be decided upon before undertaking considerations of the available data: The question should determine the data required, rather than the data possessed determining the question or, again, it can be that the resulting answers are to questions that are of little value to the client. If the question was not worth asking prior to discovering you have data capable of answering it using data science and machine learning methods, it likely (though not always) remains of little value once this has been discovered. Remember: You can always obtain more data, and the aim should always be to produce models capable of answering high value questions.
2. Acquire Data
Once the problem is decided upon, it is time to acquire data - even if that means simply work out which data is relevant in pre-existing data storage. At this point, the more data the better, and you should encourage the client to obtain as much data of as many different types and variables as possible. Data storage is cheap, and it is better to have data that you never use than to need data you do not have.
Type of Data
Structured data is the form that most people think of when they think of data. A number of cases giving values of a number of variables, it can be thought of as a table. The variables are (typically) the columns and the cases the rows. Different areas are more or less likely to have immediate access to structured data. When training and using statistical models, you will (generally) use structured data.
Everything that does not fit the neat definition above is unstructured data. Video, audio, images, text, websites, are just a small nunmber of examples of unstructured data. Since you typically need to work with structured data, starting with unstructured data complicates matters. You will need to extract features (see below) from the unstructured data and in this way obtain structured data that you can use with the machine learning techniques you plan to work with.
Preprocessing involves making the data ready to use in statistical models. It is a core duty of the data scientist, and many projects require the data scientist to spend more time in this step than in creating and evaluating statistical models. A joke of the profession is that a data scientist spends 90% of their time cleaning and preparing data, and 10% of their time complaining that they spend 90% of their time cleaning and preparing data.
3. Clean Data
This step involves finding and correcting/removing from the data corrupt, inaccurate or unusable values, as well as homogenizing the data. Homogenization means dealing with cases where different subsets of the data record information in different ways, such as using different units for real valued variables, or different labels for nominal valued variables. Recognizing inaccurate values can require significant domain knowledge.
Data cleaning may be undertaken by a data scientist, or instead be performed before the data is handed to the data scientist for analysis.
4. Extract Features
In this step, we decide upon the features that will be used to characterize particular cases in our data. It may seem this step is trivial. If so, that is because you are thinking of structured data where the data already determines the features involved - they are the columns of the data set. But consider unstructured data, such as video, audio, images, natural language text, etc. In these cases, deciding what features should be extracted from the data, and extracting them, can be a time-consuming and difficult process. Typically, there are standard approaches used for different sorts of data - such as the bag-of-words approach used for text analysis, where cases are documents, features are individual words (from some selected vocabulary) and feature values give the number of time the word is present in the document.
Some machine learning techniques automate components of feature extraction. For instance, traditional image analysis required substantial manual effort in feature extraction: Chosing the image-components to look for (such as lines, dots, etc) and deciding how to represent these components and their spatial relationships in such a way as to be usable for machine learning algorithms was a specialized and difficult task. These days, convolutional neural networks (a deep learning method not covered in this course) are often used for image analysis. They take as input the matrices of values representing the pixel colors in an image and proceed to extract higher abstraction features from this automatically.
5. Features Engineering
The raw features involved in structured data or extracted from unstructured data may not be the best set of features to use. There may be too many to use (or too many to use in a suitable model), some may be of low quality, or some may be redundant.
We typically want to keep models as simple as possible optimize their performance, something we will go into more detail about when we look at expected loss and the bias-variance decomposition later this week. Accordingly, it is important to select the smallest number of maximally informative features. Feature selection approaches do this by seeking to find an appropriate subset of the raw features to use with our models. Feature transformation approaches take a more sophisticated line and seek to transform our original features into a small number of highly informative features. We will look more at feature engineer in week 4.
6. Deal with Missing Data
In real life, data sets are often incomplete, meaning that they are missing some values. But most statistical modeling algorithms cannot work with incomplete data. So when facing such cases, the data scientist must either ‘complete’ the data, providing reasonable estimates of the values that are missing, or remove the cases that contain missing values. We will look more at how to deal with missing values in data, and techniques for estimating the values that are missing, in week 4.
Once the data is ready for use in statistical models, it is time to turn to creating and evaluating these models. This too is a core duty of the data scientist.
7. Generate Models
Typically a number of different types of models are created, so as to be able to evaluate which approach performs best on the problem at hand with the available data. This model choice is restricted by the type of problem (classification, regression, etc), the amount of data available, as well as desired characteristics of the model (speed of use, interpretability, ease of encoding domain knowledge, etc). It is in this step that the data scientist builds these models.
8. Select Final Model
Once a set of statistical models have been created, they are evaluated. The best performing model is then selected. There are many choices to be made regarding how models should be evaluted, and we look at model selection and evaluation in week 2.
9. Evaluate Final Model
If an unbiased estimate of the selected model’s performance is required, it is normally necessary to perform an additional step. We discuss the reasons for this in week 2.
Once the final model is selected and evaluated, it is time for the data scientist to turn it over to application developers who will incorporate it in an application.
10. Use Model
If the final statistical model selected is evaluated as sufficiently high-performing, it will be used in whatever task it was designed for. This requires incorporating it into an application that will allow it access to the new data that it will be used on, and allow it to output its results to humans or other programs.
© Dr Michael Ashcroft