Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

What other systems could you design?

Here we will talk about specific deep learning technologies that are responsible for some really amazing creative work.
Glowing brain floating above handheld device

We have now seen how IML is implemented in a practical sense. However, this is only one context in which people are creating art with AI. Here we will talk about specific deep learning technologies that are responsible for some really amazing creative work.


The Pix2Pix network is the short catchy name for image-to-image translation using a conditional adversarial network. The paper that introduced this type of neural network was written by Phillip Isola et al. in 2016. Since then, many AI artists have used this network in their creative work.

The basic idea is that Pix2Pix takes pairs of related images and learns how to convert from one image to the other. The network can be trained to associate any type of image to another image. This makes it very flexible allowing artists to create interesting works such as Gene Cogan’s Invisible Cities.


The Vid2Vid neural network was developed by taking the Pix2Pix technology and building on it to work for video. Just as Pix2Pix takes pairs of images, Vid2Vid works with video clips. This neural network was introduced by Wang et al. in 2018. It is different from Pix2pix in that it is capable of learning temporal information between the frames of the video. That is, how the content of the frames changes over time. Similar to how Pix2Pix can create images of building facades from pre-labelled mock ups, Vid2vid can take mock-up video and convert it into photo-realistic footage.


We heard about deepfakes earlier in Step 2.2 and heard about their uncanny ability to create footage that never really happened. But what is the technology that makes them work? Deepfakes are built on a specific type of deep neural network called an autoencoder. Terence Broad’s Blade Runner: Autoencoded that you read about earlier in the week uses this same type of network.

An autoencoder has the ability to analyse an image, extract information from the image, save that information in a highly compressed state and then re-create the image. To see a nice diagram of this process, see Alan Zucconi’s article linked below. In separate phases, the network encodes and then decodes the image. It turns out that if you train the network on two different faces, you can superimpose the features of one face on another face in a believable manner.

Sound and Music Generators

Deep neural networks are not only valuable for visual content. Over the past few years, researchers have developed several networks that are able to produce music, both notation and raw audio. The open source research project Magenta has recently helped bands such as The Flaming Lips to come up with songs and contributed to their live performances. They used the Piano Genie model to power a keyboard of fruit, the Fruit Genie. This machine learning model is built on a long short term memory (LSTM) network that remembers what it has done in the past to help in predicting its output. The model makes automatic mappings between an eight button controller played by the musician and the 88 keys of the standard piano.

WaveNet, from DeepMind, is a type of convolutional neural network (CNN) that can produce raw audio files and has been used to successfully synthesise the human voice. However, it is not just limited to vocal sounds and can synthesise any type of audio. CNNs had been previously mainly used for processing two-dimensional images. However, after adapting the architecture of the network to run on one dimensional data, the network could then take raw audio as input.

MuseNet was developed by OpenAI in 2019. It is powered by a deep neural network that can produce four minute songs with up to ten instruments. It can also blend different styles of music. For example, it could create a piece of music that is a mix between Chopin and Bon Jovi. The network that MuseNet is built on is called a sparse transformer and is specialised in predicting what comes next in a sequence.

Have your say

Which application of deep learning in the video caught your attention and inspired your imagination?
Share your answers in the Comments section.
This article is from the free online

Introduction to Creative AI

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now