Skip main navigation

Precision and recall

How good is your model? Watch the video on this step which explains the principles behin evaluating your model.
1940 World War Two. The Battle of Britain. The German Luftwaffe are flying bombing raids over London and southern England. In an attempt to detect and intercept enemy aircraft as they approached from France, the Royal Air Force Deployed a ring of coastal early warning radar stations, called Chain Home. This early warning radar network gave RAF commanders a window of 15 minutes to scramble fighters directly into the path of the bombing raid and proved to be decisive in defending against large scale attacks and preventing the Luftwaffe’s air superiority over the UK.
Although effective, this technology was in its infancy and operating a Chain Home Radar Station was a manpower intensive and time critical situation to convey reports to a hierarchy of military decision makers. Once a signal was received, the countdown clock was triggered and it became crucial that operators could ensure a degree of accuracy and confidence in their reporting. This is why in statistics, accuracy, precision and recall become important metrics in classification tasks. Accuracy is how close to the real value are reported measurement is. In our example, the real value is an enemy bomber. The higher the accuracy of the radar detection system, the more lives we save. Precision measures the ability of a classification model to return only relevant instances.
Therefore, of the radar signals received, how many were actually enemy bombers? Recall measures the ability of a classification model to identify all relevant instances. Therefore, of all the possible inbound airborne objects that occurred, how many did the radar system detect? Early radar was only a moderately good test for discriminating between a reference standard positive, a squadron of enemy bombers and a reference standard negative, which might be a flock of seagulls. How could the radar station operators be confident that they were reporting signals correctly?
An operator could increase the sensitivity of their equipment in order to identify more signals, increasing recall and then report every signal as a potential threat of incoming bombers. But this led to RAF fighters being unnecessarily scrambled to defend against a flock of seagulls, which lowers the precision of the detection system. To avoid this, operators could tune their instruments to be more specific, and whilst this had a higher rated precision and reduced false alarms, it would also fail to detect some bombing runs in time, lowering the recall rate, leading to potentially devastating consequences. Radar operators simply could not have both high recall and high precision when interpreting signals.
And so needed a trade off between the two in order to make a positive difference. This became known as receiver, operator characteristic or rock. And is still used today as a means of evaluating classifiers. Every decision we make is a trade off. We cannot be absolutely certain of making the right choice, so we have to minimize the probability of making the wrong choice. But choices have two ways to be wrong. One you decide to do something when you shouldn’t have. And two, you decide not to do something when you should have.
When working with classifiers, there are four possible outcomes. True positive, true negative, false positive and false negative. So in our example, true positive. The radar says there are enemy bombers approaching and they really are. True negative. The radar says signals received are not enemy bombers. And they really are not. False positive. Radar says enemy bombers are incoming but there not, or at least aren’t enemy bombers, which the scrambled fighters will verify. False negative radar says that signals received are not enemy bombers, but they really are. Which is really the worst scenario for an early warning detection system. With classifiers, we want as many true and as few false results as possible.
Evaluating the accuracy of a classification model depends on what we wanted to achieve. We need to ask whether it’s more important for an algorithm to return most of the relevant results. A high recall rate, or whether we seek substantially more relevant results than irrelevant ones. A higher rate of precision.

At this point in the course, you have successfully built your first model to classify portrait and landscape paintings based on their physical dimensions.

Now we must ask ourselves how good will it be in predicting the right answer? In other words, how accurately does it predict the right label, does it correctly identify only the relevant items of data, and has it classified all instances of the data we wish to find?

Your task

Watch the video and learn the principles behind how we will evaluate our model.
In the comments, explore how the example of receiving signals in a radar system relates to our example of classifying portrait and landscape paintings.
This article is from the free online

Applied Data Science

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education