Distance Metrics

Find out about different types of distances and learn how different metrics influence the partition of the data in our practical task.
© Coventry University. CC BY-NC 4.0

In order to build more complex models, we need to discuss metrics. While metrics can be very abstract, fundamentally they simply provide a notion of distance between (data) points and much of our everyday intuition about distances carries over.

Mathematically, a metric is a function (f : D times D rightarrow R) which takes two data points (x) and (y) and assigns them a real number (f(x, y)) which we interpret as their distance—if the number is small, they are close together, if it is large they are far apart.

Why are distances useful for data models? Consider the earlier example of businesses in London again. The markers on the map represent certain businesses in this area. Can you see a pattern?

You probably immediately noticed the dense cluster just north of Blackfriars bridge and the less dense but still noticeable cluster north of Saint James’s Park. If we have the additional information that there are indeed only two types of businesses in the dataset, then we should investigate whether those two geographical clusters have anything to do with the business type. For example, we could try to separate the two clusters by a straight line and surmise that all businesses on the left are of type A and all on the right are of type B:

Of course, reality is more complicated, but our very ad-hoc ‘clustering by proximity’ works surprisingly well. This is mostly owed to the type of business we chose here. The following map finally contains the ‘ground truth’, the correct labelling of businesses as jewellers (red) and art galleries (blue) Our simple model correctly identifies 52% of the former and 80% of the latter:

Art galleries are concentrated in the affluent Mayfair district, but we do see plenty of jewellers there as well, something that our simple model fails to capture. The dense red clusters are business located in Hatton Garden, the historic jeweller’s quarter. It is worthwhile to reflect the underlying core assumptions we made when we built our simple model: Geographic proximity reflects domain similarity. In other words, data points that are close together likely have the same label. We want to generalize and abstract this idea to higher-dimensional data, which leads us back to metrics: provided we choose a metric that captures ‘similarity’ of data, we can built try find and separate different clusters or transfer labels from one datapoint to nearby other ones.

At this point you might ask: why metrics? Is distance not always just distance? It turns out that even in the real world, we need more than the ‘as the crow flies’ distance! Below is a map of midtown Manhattan. Let’s say we would like to travel from the intersection 10th Avenue/44 Street to the entrance of Central Park. The Euclidean distance (‘as the crow flies’, pink line) between these two points is about one mile. However, it is impossible to travel along this trajectory, instead we must follow the grid-layout of the Manhattan road network (blue line). The distance measured according to this Manhattan distance is 1.43 miles is a much more realistic estimate of how far we will have to travel in this specific situation.

We will focus on distances between vectors here and there are several established metrics that we can chose from. For two d-dimensional vectors (x ⃗=(x_{1}, x_{2},…,x_{d} )) and (y ⃗= (y_{1},y_{2},…,y_{d})) , these metrics are defined as follows:

Euclidean distance: To compute the Euclidean distance, we take the elementwise differences of the vectors, square these differences, sum them up and finally take the square root of the sum. Written as a formula, this is written as, (d_{EUC}(x ⃗,y ⃗ )=√sum_{i=1}^d (x_{i}-y_{i})^2). The Euclidean distance, (also called the L2 norm) is what we understand as the ‘natural’ distance between objects in the real world.

Manhattan distance: The Manhattan distance (also called Taxicab distance or L1 norm), as we discussed above, measures distances ‘along a grid’. In essence, we want to know by how much we need to change every coordinate of (x ⃗) in order to turn it into (y ⃗). To compute it, we take the absolute difference between each component of the vectors and sum those values up. Expressed as a formula: (d_{MAN}(x ⃗,y ⃗ )= sum_{i=1}^d <x_{i}-y_{i}> .)

Hamming distance: The Hamming distance also measures how vectors differ component-wise, but in contrast to the Manhattan distance it only matters whether the components differ, not by how much. To compute it, we count all components in which (x ⃗) and (y ⃗) differ. We can express this as the formula (d_{HAM}(x ⃗,y ⃗ )= sum_{i=1}^d 1_{x_{i}≠y_{i}}) where the indicator function (1_{x_{i}≠y_{i}}) evaluates to 1 if the condition in the subscript is true and 0 otherwise.

Chebyshev distance: The Chebyshev distance measures the maximum difference across all coordinates. It is also known as the ‘chessboard distance’ because it corresponds to the number of moves the King in chess needs to make in order to reach a certain position. We compute it using the formula d_{CHB} (x ⃗,y ⃗ ) = <(max)┬i⁡{x_{i}-y_{i}}.>

Cosine distance: As our final example, the cosine distance measures the angle between the vectors (x ⃗) and (y ⃗). Imagine two lines drawn from the origin to the points ((x_{1}, x_{2}, …, x{d})) and ((y_{1}, y_{2}, …, y{d})) in d-dimensional space, then the cosine distance is one minus the cosine of the angle between those two lines. Mathematically, this is expressed as the formula

<d_“COS” (x ⃗,y ⃗ )=1-(∑(i=1)^d▒x_i y_i)/√(∑(i=1)^d▒x_i^2 ⋅∑_(i=1)^d▒y_i^2 )>

or, using vector notation for the dot-product and the norm of a vector, <d_“COS” (x ⃗,y ⃗ )=1-(x ⃗⋅y ⃗)/(|x ⃗|⋅|y ⃗|).>

The choice of metric is a deep topic that we cannot cover in detail here. Some strengths and weaknesses for certain metrics are known, like that cosine distance behaving better in higher dimensions than the other metrics. However, at the end of the day the choice of metric is heavily influenced by the data we work with and the methodology we choose to apply.

This task involves visualising metrics in Python (10 mins)
Open and review this Jupyter notebook. Then run the notebook and generate the plots presented in this weeks activity.
You can adapt the plots in this notebook to change the look and feel of the plots.
Visit the [Folium library quick start – need the correct link] to get an overview of the styling options available. We will be revisiting location data and the Python Folium library later
© Coventry University. CC BY-NC 4.0