Skip main navigation

How do GANs generate faces?

Here, we look at how the concept of an adversary can be turned to our advantage, by generating random, photorealistic images of faces.
Person with binary projected onto them

Here, we look at how the concept of an adversary can be turned to our advantage. We are going to focus on the task of generating random, photorealistic images of faces. An approach proposed in 2014 has proven extremely successful for this task. This approach is called a Generative Adversarial Network (GAN) and consists of two networks. The generator transforms a random vector into an image. The discriminator tries to differentiate between real and fake images. The two networks are trained in parallel, playing a kind of game: the generator is trying to fool the discriminator while the discriminator is trying to spot the fakes from the generator.

The generator

The first network is the one that is going to synthesise the images. The architecture looks a little like a backwards version of the CNNs you’ve seen so far. It takes a vector, \(z\), as input at the first layer and outputs an image, \(i=g(z)\), from the final layer. Such an architecture could be built from layers you’ve already seen. For example, fully connected layers could map from the input vector to a tensor with low spatial resolution. Then a series of convolutional layers could process tensors of increasing spatial resolution until the final layer. The only bit we haven’t seen is some kind of opposite to pooling that, instead of decreasing resolution, increases it. For this, we could use some kind of upsampling like is used when you resize a photo in an image editing tool. The simplest is nearest neighbour upsampling where each pixel is simply replicated four times:

Nearest neighbour unsampling example

In reality, the architecture of the generator is very important and lots of subtle tricks have been developed for good performance. But a basic combination of fully connected layers, upsampling, convolution and ReLU would work reasonably well.

We say that the vector input to the generator is a point in the latent space of the model. This latent space is a compressed representation of the data (in our case face images) in which similar images have similar latent vectors (i.e. are close together in the latent space). This is a bit like the embedding that our face recognition network computed earlier in the course. In a GAN, we decide the distribution that points in the latent space follow. Most commonly, we say that they follow a multivariate normal distribution, a concept you might have come across in your mathematical studies.

When training our GAN, we randomly generate latent vectors by sampling from the distribution. All this means is that we generate random numbers that follow that distribution. We feed the random latent vector into the generator and it outputs an image. Once training is finished, we expect this image to look exactly like a photograph of a face.

Discriminator

The discriminator is a classification CNN exactly like you’ve seen before. It takes an image, i, as input and outputs d(i), the probability that the image is a real face image. During training, the discriminator is given real face images half the time and output from our generator the other half. For the real face images, this means that we need a large dataset of real face images, similar to the training data for the face recognition network.

GAN training

The training process uses a very slight twist on the idea of backpropagation. The discriminator is trained as normal. It is given either a real or synthesised fake image as input and a classification loss is computed. This discriminator loss is used to update the weights in the discriminator network (i.e. to help the discriminator get better at detecting fakes).

Discriminator training

However, when a fake image was provided as input, we compute a classification loss with flipped labels – i.e. we want it to misclassify the fake image. The gradient of this loss is backpropagated through the discriminator (without updating the discriminator weights) and then into the generator (where the weights are updated). These updates will make the discriminator perform worse by changing the output pixel values so that the fake image is closer to being classified as real.

Generator training

This process continues until an equilibrium is found in which the discriminator assigns both real and fake images a probability of 0.5 – i.e. it cannot tell the difference between real and fake images. At this point, the generator has hopefully learnt the true distribution of the real data. In our case, this means it can generate the whole space of possible face images.

© University of York
This article is from the free online

Intelligent Systems: An Introduction to Deep Learning and Autonomous Systems

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now