# What is an adversarial example?

An adversarial example is a network input that has been specifically designed so that the network makes a mistake.

Here we’re going to look at a weakness of neural networks – a way that they can be ‘attacked” by an adversary in order to produce nonsensical output.

An adversarial example is a network input that has been specifically designed so that the network makes a mistake. These examples are theoretically interesting because they might tell us something about what the network has learned and how it is making decisions. On the other hand, they are also practically worrying.

Consider autonomous systems such as self-driving cars. What if a malicious person could attach an image to the back of their car that caused the vision systems in self-driving cars to produce nonsense output? This may seem far fetched but it has already been shown possible to create adversarial patches that can be printed out and when observed by a CNN, the network is forced to make a mistake.

So, how do you go about creating an adversarial example? It turns out to be very easy and only needs techniques that you already know about.

Take any input image, x, that is correctly classified by the network. Suppose that the network was trained with a loss function $$E(f(x),y_{\text{correct}})$$ where $$f$$ applies the network and $$y_{\text{correct}}$$ is the correct label for $$x$$. We now wish to find a perturbation that we can add to $$x$$, i.e. a slight modification of the image, so that the network incorrectly classifies it as class $$y_{\text{incorrect}}$$. We can do this by solving an optimisation problem:

Find the $$r$$ that minimises: $$E(f(x+r),y_{\text{incorrect}})+c\|r\|$$

As usual, we solve the optimisation using gradient descent and backpropagate all the way through the network right back to the input image and hence $$r$$. We do not update the network at all.

The function we’re minimising has two parts added together. The first part says: make $$x+r$$ be classified as $$y_{\text{incorrect}}$$ by getting the classification loss as low as possible. The second part says: make $$r$$ small so that it only modifies the image slightly. The value $$c$$ is a parameter that allows us to trade off the two objectives (if we make it small, the optimisation will focus on changing the classification but might make the perturbation quite large).

## Example results

Now we get to the worrying observation. It turns out to be very easy to find tiny modifications that completely change the classification result! In fact, the perturbations are so small you can’t even see them! In the following example, the images in the first column were all correctly classified. In the second column we see r, the perturbation, magnified so that it’s visible. In the third column we see $$x+r$$, the perturbed image – now an adversarial example. All three images in the third column were classified as … wait for it … ostrich!

You should agree that it’s hard to see that the images in the third column look any different to those in the first column. Also, they look nothing like an ostrich!

The reason for this behaviour is still an open research question and the subject of a lot of debate.

## Other attacks

The approach above modified the whole image by a small amount. But other research has shown that it is possible to change the classification by only modifying a single pixel! Or by introducing a small local image patch that can be printed and introduced into a real scene.

## Defences

The existence of adversarial examples is worrying. It makes it seem that our network is very fragile. Perhaps when it has to operate in the real world and sees lots of new data, it won’t generalise and will make mistakes in a similar way to adversarial examples? And they leave us open to malicious attacks. For this reason, there has been a lot of work on adversarial robustness – effectively trying to create networks for which adversarial examples cannot be constructed.

One interesting idea is to treat every image as a small volume instead of a point in the high dimensional input space. Instead of training so that only that point is correctly classified, we design a loss so that any point a small distance from the training sample is also classified correctly. But despite this progress, adversarial attacks are still a major concern in deep learning and making sure systems that use deep learning are safe.

### References

1. Szegedy, Christian, et al. “Intriguing properties of neural networks.” arXiv preprint arXiv:1312.6199 (2013).
2. Su, Jiawei, Danilo Vasconcellos Vargas, and Kouichi Sakurai. “One pixel attack for fooling deep neural networks.” IEEE Transactions on Evolutionary Computation (2019).
3. Brown, Tom B., et al. “Adversarial patch.” arXiv preprint arXiv:1712.09665 (2017).
4. Wong, Eric, et al. “Scaling provable adversarial defenses.” Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018.