Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

How does face recognition work?

This article looks at some additional considerations required to make face recognition work as effectively as possible.
Abstract face image in rock
© University of York

Using a CNN to compute an embedding of a face image is the standard method in state of the art face recognition.

However, there are some additional considerations required to make it work well in practice.

Let’s take a look at them.

Training data

We must carefully choose our training data such that we observe wide variability in images of each identity. The embedding network is therefore forced to learn to ignore image variation not related to identity and to focus on extracting features useful for characterising identity. Such datasets are often obtained by crawling the internet.

For example, if you type a celebrity’s name into an image search, you will often find hundreds of images of that person.

Google images of Tom Hanks

This enables datasets of 10s of millions of images and 100s of thousands of identities to be obtained automatically. Of course, we must be careful to clean such datasets to remove erroneously labelled images.

For example, the images returned by the search may include other people or even lookalikes of the original celebrity. Data cleansing is an open research problem in itself and is currently solved using a combination of huge amounts of manual labour and some semi-automatic tricks.

For example, we can train an initial version of the network, use this to label all of a validation set and then have a human manually check any images where the network makes a mistake.

Triplet selection

A triplet loss is used to learn a good embedding. However, there are a vast number of possible triplets (for every pair of images of one person, you can create a triplet using every image of every other person in the training set). Many of these will be easy, i.e. the embedding already separates the different identities well.

The network does not learn anything useful from these triplets. For this reason, it is good to mine hard triplet examples. This involves finding pairs of the same identity that embed far apart and pairs of different identities that embed close together.

Hard triplets are formed from these pairs to force the network to focus on improving the embeddings for the most challenging images.

Face detection

Often, a face is only a small part of an image. If we give the whole image to the embedding network, the network must learn to ignore almost all of the images. In addition, since networks are practically limited to taking relatively small images as input, the face itself will end up tiny – compressed to only a few pixels and unrecognisable at that resolution.

For this reason, face embedding networks usually assume that the input images have already been cropped to a bounding box around the face.

The entirely separate task of finding the approximate location of every face in an image is called face detection. Different architectures are normally used for this.

For example, we might train the network to regress bounding box coordinates from an image. Or we might have it output a heatmap where every pixel represents the probability that a face is present at that location and then fit bounding boxes to the blobs with the highest probability.

Landmarking and alignment

Finally, there is a question about whether it is worth aligning the faces more precisely before passing them to the embedding network.

For example, you might choose to rotate, scale and translate the image so that the eye centres are always in the same location. This should make the task of recognition easier since the same feature will end up in roughly the same location, removing one unhelpful source of variation.

To do this, usually, a set of landmark locations are estimated on the bounding box cropped images. This is done by another deep CNN and again a variety of different architectures can be used.

Most commonly, either the 2D coordinates of each landmark are directly predicted with a regression network or the network outputs probability heat maps for each landmark.

Face feature identification

The breakthrough paper in this area used a very elaborate alignment procedure involving the estimation of 67 landmarks, fitting a 3D model to these landmarks and then warping the image to a frontal pose using the 3D model.

Sylvester Stallone 3D modelling process example

However, for the same reasons we criticised handcrafting of features earlier in the course, handcrafting an alignment pipeline is not necessarily optimal. Given enough training data, any required alignment can be learnt as part of the overall face embedding process.


  1. Taigman, Yaniv, et al. “Deepface: Closing the gap to human-level performance in face verification.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.
  2. Schroff, Florian, Dmitry Kalenichenko, and James Philbin. “Facenet: A unified embedding for face recognition and clustering.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
© University of York
This article is from the free online

Intelligent Systems: An Introduction to Deep Learning and Autonomous Systems

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now