Hi. We’re here with our resident data science expert, and we’re going to look at cleaning data to make sure it’s in a good state for analysis. We have our dataset here to look at, which has to do with educational psychology, and the impact of intervention in cases. So having a look at what you can see here, would you be able to work with this data? Well, at the moment, just eyeballing the data, I can see that there’s one big problem that jumps out at me immediately, which is that there is a missing value in the data.
So we wouldn’t be able to draw any inferences at the moment, and indeed, that would probably crash my software because it would be looking for a value in there. So what can we do about that? So there’s a couple of things we can do. The first would be to remove the entire row of data where there is a missing value, and that might be in one particular area, or across the board. If we remove that value, we reduce the accuracy, and size of the dataset. Would that cause you problems? So in order to draw good inference, we want as much data as possible. So we would really like that row to be in there.
So if there’s something we can do about that missing value, that would be really good. If it’s something that we can infer or we can put in a token value for it, a typical value, that would be really good. Well, that’s exactly what I would look to do then to pass, of course, to you, and to do the analysis on it. Now, we can look at distribution curves, and technical items in order to plug in a value that is representative of the remainder of the dataset. So let’s assume we’ve gone through that technical process, that statistical process.
We will then complete that row for you by filling in a figure that is likely to be correct, and has statistical significance for you. And at that point, we could say, OK, we’ve done that. We’ve done it for 70 of our 1,000 rows. There you go, here’s your dataset again. Can you please work with this, and tell us if attendance rate improves before or after interventions? Well, I can’t work with it quite yet. What I would do with the data is that I would run it through maybe some kind of sanity checking programme, or I might simply visualise the data by graphing it.
But what I can detect here just by looking at this small portion of the data is that some of the values are way outside where I would normally expect them to be. Indeed, I can see that here, we’ve got a value of 150%, and here we’ve got a date, which is, in fact, in the future. So it’s saying that the case opened in 2022. Those must be wrong. So we need to fix those. We need to try and find out what’s going wrong here. Someone’s typed in something wrong. So we treat those probably in a similar way to the missing values. We need to fix these in some way, and put in something that we regard as a safe value.
The other thing I might notice is that some of the data rows, when I visualise it or run it through my sanity checking programme, some of the data rows look to be what we call outliers, and those are data points which are not in the normal range. So this one here, for example, where the attendance rate is 3%. So we might actually want to remove that outlier altogether because it might skew the inferences that we make from it. We want to make inferences based on typical data. It might be that that one data point is not typical, but we don’t want to throw the inference out.
OK. Well, as you say, yes, we’ll be looking to apply some sort of similar methodology to deal with outliers. So if we can imagine a simple box plot, we can identify areas where outliers exists in a certain dataset, or a subset of that dataset. So if we look at the subset of the dataset, we might then want to apply values, as you suggest, in order to give you accurate data. And we’ve dealt with, basically, figures that have probably being keyed in wrong at that stage. We also might want to look across the entire dataset to look for patterns of outliers.
If we were dealing with an area that we believed, for example, the school was recording the data wrongly, what would be your preference in receiving the data? So as a data scientist, I would like to have good data. But if some of the data are suspect, I would at least like to know that. So for example, you could put another column into the data with a note saying these are affected. All these data are affected by the same issue. So treat this particular data field with a pinch of salt, and I might decide to do various things with it when I actually do run my algorithms, and do the analytics.
And there’s two more things then we want to look at in terms of cleaning the data. We’ve got outlier values when we’re looking at a series of numbers, but we might also have coded up data entries. For example, reason for referral. And in this, we’d be looking to make sure that all the reasons for a referral do actually correspond to a reason. So the code, BY, for example, might not actually be a code in any database. So we need to make sure that all of the entries in that column are relevant for a particular type of data.
And the final thing that we need to do as well is before I pass the data across, actually, and we’ve not gone through this process necessarily sharing it openly in this way, but before I pass the data across to the research side and the big data analytics, is to consider the personal information in the dataset. Now, for research purposes, we want there to be a high degree of accuracy in the data, and we will have trust with our researchers through some sort of a data sharing agreement. But the process I would like to go through is then to depersonalise the data, or anonymize the data, and this could be as simple as removing the name.
And then I could pass it across to you, could I? Well, indeed, and the first thing I would want to do with the data is that I would need to get it into a format where I can actually run an automated algorithm on it to make the inferences. Now, at the moment, I can’t do this, and the reason I can’t do this is that the data are in all kinds of formats.
For example, here we have a date. Well, I need that to be a number. The question you’ve asked me is does the intervention affect their attendance rate? And one of the things I want to ask is does the length of the intervention affect the rate? So one of the things I’m going to do is I will normalise these values so that they’re numeric, and I will also maybe add in an extra column talking about the length of the intervention in days. One of the other things that I’ll note about the data, even after it’s been anonymized, is that some of the names, or the anonymized names, occur more than once.
Now, it might be that you’ve got the same person who’s having an intervention more than once. It could be that these are just names– these are different people but with the same name. In either case, what we’re interested in is actually the intervention. That’s the primary object. So I would create, for each of the lines in the data, because it represents an intervention, I would create a unique identifier for that intervention so that I can identify each line in the table individually. Once I’ve done that, I would start by doing some exploratory data analysis.