Before we go on to analyse our data, we need to consider whether the data we have gathered is correct. This is known as data validation.
Going back to the example of milk, if your records show you drank 11 pints of milk today, did you record it correctly or accidentally ‘double tap’ the ‘1’ key? We need to look out for the following anomalies:
- Systematic errors: for instance, the flow meter pumping the milk to the tanker needs re-calibrating.
- Error by omission: the shop forgot to sign in the delivery and have too much milk in the store.
- Incomplete data: when checking in a delivery, we recorded the total volume of milk but not the package sizes.
- Duplicate data: we checked the same delivery in twice.
- Sampling error or bias: we measure the milk output from the first ten cows into the milking parlour. Is it always the same ten cows? Is there any correlation between the time of arrival and milk output?
- Rounding errors: consumers complain if the milk is under-filled, so our bottling machine always errs on the high side. Effectively, our volume sold is rounded down every time.
If you sign up for the Coventry University program ‘AI Technologies for Business and Management’, the next short course, ‘Fundamental machine Learning for AI’ will illustrate some methods for validating and cleaning your data, as well as illustrating the dangers of leaving it to the machine.
Think of three types of data anomaly that could be presented in the ‘grass to glass’ scenario.
For each anomaly, think of one way of correcting it, then counter that example with a situation where it would be incorrect. For example, 11 pints of milk – if it’s a household consumer, we might suggest a maximum value of three pints per day, but what if it’s a small business or cafe?
Think about more than just simple numbers, are there any global variances that might need adjusting if your Artificial Intelligence (AI) system is rolled out in another country (for example, dates in the USA or the spelling of words in different countries)?
© Coventry University. CC BY-NC 4.0