Bias and error in data collection
In a statistical sense, bias at the collection stage means that the data you have gathered is not representative of the group or activity you want to say something about.
Shortcuts and mistakes of various kinds are part of what makes us human. As the author and psychologist Daniel Levitin (2016) says:
Remember, people gather statistics. People choose what to count, and how to go about counting. There are a host of errors and biases that can enter into the collection process and these can lead millions of people to draw the wrong conclusions.
Bias and error may be unintentional, but sometimes we know it has happened and ignored for the sake of an easier life.
Some common biases or error sources to look out for in your own and others’ work include:
An unbiased sampling method should mean that an individual data item within a whole set of items, eg a list of potential participants in an experiment, could be included. This is why researchers often try to randomise the sampling process or sample randomly from a set of pre-defined groups. If there is some aspect that limits this, you are risking bias.
For instance, if we decide to base a study of citizen digital skills on the results of an internet survey, we’re excluding people without internet access, who may be significantly lower in their digital skills level. Therefore, our data is not representative.
Measuring and calculating
Data collected from instruments or digital systems may be prone to error if, for example, a sensor is broken or a log file is corrupted. Often people will incorporate a sense-check that picks up anomalies like these.
Similarly, it is easy for the researcher themself to make a simple error in calculation, especially when complex code is used to determine results. This is where open source and open data can help in letting other people reproduce the calculations to check them.
In an opinion survey, you may be more likely to get participation from those people who have a strong feeling about the subject one way or another. Whereas those who are less opinionated may be less likely to respond. One way to avoid this is to advertise it with a more general theme. For example, you might say your survey is about shopping in general rather than views on Marmite.
People may tell you something about their own behaviour that is different to the way they behave in practice. This is why it is sometimes good to combine spoken reports with behavioural data.
It’s important for a data scientist to sharpen their understanding of statistical evidence and the claims or decisions that it may or may not support.
To sharpen your own understanding, we recommend you practise spotting biases and errors in your daily life – as you read newspapers, social media and research studies.
Going forward, you will gain confidence in designing your own analyses and applying appropriate statistical tests. You’ll also become adept at explaining what the results imply and understanding their limitations or the potential sources of bias that might have been introduced during data collection.
Have a look at the report Bouncing Back: Consumer Views on Traveling Again which reports on a survey looking at people’s intentions to fly again post-COVID-19.
If the report is about all potential consumers, what bias might there be from their sampling strategy?
Think about some of the possible issues with bias and error and how the survey could have been presented to the respondents. Share your ideas with your fellow learners in the comments area.
Levitin, D. (2016). A field guide to lies and statistics. Viking.
Flywire. (2020). Bouncing back: Consumer views on travelling again. https://flywire.foleon.com/report/bouncing-back-consumer-views-on-traveling-again/cover/?
© Coventry University. CC BY-NC 4.0