Skip main navigation

Deep face recognition

In preparation.
Man wearing mask
© University of York

The problem of face recognition usually takes one of two possible forms:

  1. “Am I who I say I am?” – This is known as face authentication. A user presents an image of a known individual (such as the photograph in a passport) and the system must determine if this matches the user’s face.

  2. “Who am I?” – This is known as face recognition. The identity of a given face must be found in a database of known individuals.

It’s not obvious how to solve either of these problems using what we’ve already covered. We might consider the second problem as one of classification. However, if we were to have one output class for every person in the world, we would need billions of classes. Alternatively, if we trained a network only to classify a fixed set of people, we would need to completely retrain the network every time we wanted to add new people.

In addition, distinguishing between faces is difficult. Faces are all very much alike and subtle differences distinguish one person from another. Changes in background, illumination, face expression, hair style, hats, glasses and many other factors change the image a lot but do not change who the person is.

Embedding networks

There is a good solution to both face recognition and authentication which involves a slightly different class of network. It also helps us to ignore image variability that is not related to identity. Instead of classification or regression, we will instead train a network to embed a face image into a low dimensional space. The idea is that this embedding will depend only on the identity of the person. By measuring the difference between embeddings we can decide whether two images are of the same person.

An embedding is simply a point in a low dimensional space, represented by a vector. When we say “low dimensional space” we simply mean that the vector has many fewer dimensions than the original image. For example, if we are working with images of size \(256 \times 256\) and 3 colour channels then our original data lies in a \(256\times 256\times 3 = 196,608\) dimensional space. Our embedding might have only 512 dimensions.

Mathematically, if our input image is \(x\), we compute \(y = f(x)\) where \(f\) is a CNN and \(y\) is the embedding (a vector). The idea is that \(y\) only depends on the identity of the face in \(x\). Then, to compare two images \(x_1\) and \(x_2\), we compute their embeddings \(y_1\) and \(y_2\) using the trained network. We then compare the two embeddings using some distance measure \(d(y_1,y_2)\). For example, we might simply use the Euclidean distance between them: \(d(y_1,y_2) = \|y_1 – y_2\|\) (in practice other distance measures work better). If this value is smaller than some chosen threshold, we conclude that the images must be of the same person. If it’s larger, we conclude they do not match.

Triplet loss

The question is: how do we train \(f\) so that \(y\) depends only on the identity of the face in \(x\)? Here we use a clever trick called Siamese training. We use three identical copies of the current version of the network. We pass three images through the three networks: \(x_1\), \(x_2\) and \(x_3\). We choose the images so that \(x_1\) and \(x_2\) are the same person while \(x_3\) is a different person. Now we compute a loss using the three embeddings that encourages the embeddings of the same person to be close together and embeddings of different people to be far apart:

\[L = \max(d(y_1,y_2) – d(y_1,y_3)+\alpha,0)\]

where \(\alpha\) is the margin by which we wish negative pairs to be further apart than positive pairs.

Now we backpropagate this loss into the three copies of the network and combine each of the gradient descent updates such that all three copies are updated identically.

Triplet loss

In practice, the network is usually initialised by training a classification network with a fixed number of face identities. Then we throw away the classification layer and continue training using the layer prior to the classification layer as our embedding output.

Using an embedding network

As already mentioned, face authentication is solved simply by deciding on a distance threshold between two embeddings. This threshold trades off incorrectly saying two faces are not the same identity (false negatives) against incorrectly saying two faces are the same identity (false positives). For face recognition, we simply compute the embedding vector for every face in our database. Then, given an input image, we compute its embedding, find the closest embedding in the database and report this as the identity of the image.


Parkhi, Omkar M., Andrea Vedaldi, and Andrew Zisserman. “Deep face recognition.” Proceedings of the British Machine Vision Conference. (2015).

© University of York
This article is from the free online

Intelligent Systems: An Introduction to Deep Learning and Autonomous Systems

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now