Skip main navigation

Deep fakes

GANs can synthesise not just random indentities but faces of specific people too. Dr Will Smith explains how.
We’ve seen how a GAN enables the generation of random, photorealistic, still face images. But how about if we want to synthesise not just random identities, but the face of a specific person? And what if, rather than just a single still image, we want to synthesise whole videos of that face? Perhaps while talking, pulling particular expressions and changing pose in a desired way. This is often done in movies when digital doubles replace real actors in computer generated scenes. However, this requires hundreds of hours of work from talented visual effects artists and the use of expensive equipment. We’ll now see how similar quality results can be achieved from just a single video using ideas you’ve already seen in deep learning and GANs.
These are now often known as deep fake videos. First, we need a video of the person we would like to appear in the fake video. Second, we need a way to control the pose and expression of the target face. The easiest way to do this is to copy it from a source video. So, we provide a second video of a different person performing the speech and expressions that we would like to map into the fake video. Next, we use a computer vision technique to estimate the 3D shape and texture of the face in each frame of the source and target videos.
We don’t have time to cover how this is done here, but suffice to say it can also be done using deep learning. Now, we can copy the pose and expression from the source to the target face and create an image for each frame using basic computer graphics. However, this won’t look realistic and won’t include things like hair or the background scene. Now we get to the clever part. We train something called a conditional GAN. Rather than generating random faces, this network generates images that depend on, or we say are conditional upon, some input. In this case, we make the deep fake output image conditional on the unrealistic computer graphics rendering image.
As for a GAN, we train this network adversarially so that it learns to produce photorealistic images.
The final question is: how do we train this conditional GAN? The trick is that we train it only on the video of the target person. In other words, we intentionally overfit to that single identity, single background, hairstyle, clothing and so on. And for training data, we use the computer graphics images we got from the original video. So, we convert original video frames into 3D models, render these to computer graphics images and then train the conditional GAN to reproduce the original video frame from this computer graphics image. Putting this all together, we can now control the computer graphics model from one video and use this drive the GAN in the output video, giving us highly convincing deep fake videos.

GANs can synthesise not just random indentities but faces of specific people too.

Dr Will Smith explains how.


Kim, Hyeongwoo, et al. “Deep video portraits.” ACM Transactions on Graphics (TOG) 37.4 (2018): 1-14.

This article is from the free online

Intelligent Systems: An Introduction to Deep Learning and Autonomous Systems

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now