Skip main navigation

Nearest Neighbour & Breast Cancer Study

Learn how to use popular machine learning algorithms such as nearest neighbor to analyze breast cancer data.

Worldwide, breast cancer is the most common invasive cancer in women. It is almost 25% of invasive cancers in women. Although it is still a threat, compared to other diseases or other cancers, breast cancer receives a proportionately higher share of resources and attention. You probably know about ‘pink ribbon.’ A pink ribbon is the most prominent symbol of breast cancer awareness. Today, we will use one of the most popular machine learning algorithms, nearest neighbor, to analyze breast cancer data.

To do this, we will use the famous Wisconsin Breast Cancer data. This data is from Dr. Wolberg’s report on his clinical cases. The goal of the study is to identify whether the breast tissue cells are benign or malignant classes. Let me rephrase. Our goal is to be able to tell whether the cell is cancerous or not by looking at some other information.

In his dataset, ten attributes describe each patient. Then we have one target variable whether each patient has a benign or malignant cell. And using this data, we will make a model that can tell whether the information from the new patient says whether the cell is closer to benign or malignant. I hope you remember that this is supervised learning.

Features are computed from a digitized image of a breast mass. They describe the characteristics of the cell present in the image. Ten attributes are radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension.

First of all, let me show what we are looking into using this dataset. In our dataset, we know who has breast cancer. What we like to do using the nearest neighbor is we want to figure out which are closest.

But how to measure how close it is. Here we will use something called Euclidean distance. Euclidean distance measures distance like this. Let’s assume we have two points of A and B. A has two-dimensional coordination of (x1 and y1) and B has (x2 and y2). The distance between the two is the square root of (x1-x2) square plus (y1-y2) square. There are many different distance measures, but we will use this Euclidean distance.

To deliver the idea of the nearest neighbor, I will use only the part of our data set. Although I used only five lines of data, the same will applies to the entire rows. Let’s say a new patient comes in, and we have ten attributes information that describes this patient’s breast tissue cell.

We will use Euclidean distance to measure, which current patient information is close to this new patient. Let me explain a little more using one sample. The distance from the new patient to the #1 patient will be square root of (1st patient radius – new patient radius) square plus (1st patient texture – new patient radius) square and do the same thing for the rest of the attributes. After we do the same calculation, we will get the distance from all instances.

Who is the closest? Who are the closest three? They are the nearest neighbors of this new patient.

How will we conclude whether this new patient has a benign or malignant cell? If we choose only one nearest neighbor, this patient will have the same conclusion as the closest one. But if we choose three nearest neighbors, this patient will have an average of three patients.

I have to tell you that I omitted a lot of data analysis processes to make the nearest neighbor algorithm. These include the normalization of data and validating the idea of the nearest neighbor. I did that only because I wanted to deliver the core idea of the method. Although you might not be able to deal with big data using the nearest neighbor by doing severe programming, I hope you got the idea of what the nearest neighbor does.

This article is from the free online

Artificial Intelligence and Machine Learning for Business

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education