£199.99 £139.99 for one year of Unlimited learning. Offer ends on 14 November 2022 at 23:59 (UTC). T&Cs apply

Find out more
Deep learning wrap-up
Skip main navigation

Deep learning wrap-up

In development.
Multi-coloured coding on screen
© University of York

We’ve now covered enough of the basics to be able to do some amazing things with deep learning. CNNs (such as the VGG architecture) trained with stochastic gradient descent using backpropagation enables good performance on many computer vision tasks.

However, that’s not the end of the story. There are many subtle issues, tricks and clever hacks that improve performance. We’ll briefly mention some of these here so that you have a flavour of the sorts of things you have to think about as a deep learning engineer.

Programming languages and deep learning toolboxes

If you’re feeling inspired and now want to try out some practical deep learning, you will need to be competent in a programming language (most commonly python) and a deep learning toolbox (the most popular is pytorch followed by tensorflow). Both python and pytorch/tensorflow are free and can be installed on almost any computer. If you work through the official introductory tutorials at pytorch.org/tutorials or www.tensorflow.org/tutorials you will quickly get a flavour of how easy it is to construct and train deep neural networks.

Small filters are better

It’s a good idea to use small filters in each CNN layer. The breakthrough CNN paper in 2012 proposed the AlexNet architecture. This used 7×7 filters. This means you need 49 x C parameters per filter. The VGG architecture proposed the use of 3×3 filters (9 x C parameters per filter). By having fewer parameters per layer, you can have more depth without increasing the number of parameters. Depth turns out to be the key to good performance.

Vanishing and exploding gradients

When backpropagating through deep network architectures, it is possible that the gradient of the loss becomes either very small or very large with respect to a given parameter. When the gradient becomes very small (the “vanishing gradient problem”), it can take a very long time to train the network. When the gradient becomes very large (the “exploding gradient problem”) each update can make very large changes to parameters leading to unstable behaviour. Also, because a computer has only limited precision with which to represent numbers, if the gradient gets small enough it can vanish to zero and if large enough it can “overflow” so that it can no longer be represented.

A very helpful trick to deal with both problems is batch normalisation. This is effectively an extra function that is applied to the inputs of each layer to avoid them getting very large or very small. It is now almost always used in network training.

Residual connections

Another breakthrough happened in 2015. A residual neural network (ResNet) adds skip connections to the sorts of CNN architectures we’ve seen so far. This connects the outputs of layer l-2 to the inputs of layer l, i.e. it skips layer l-1. The inputs to layer l are the sum of the outputs of layers l-2 and l-1.

Residual neural network schematic

This helps with vanishing gradients since, during backpropagation, there are paths from the loss back to the parameters that are shorter than in a conventional CNN. It also fixes a problem that accuracy tends to saturate with increasing depth in a conventional CNN, i.e. at some point adding more depth doesn’t improve accuracy. This problem is much reduced with a ResNet allowing networks with far greater depth to be trained.

Data Augmentation

Diversity of training data is always a problem, leading to poor generalisation. A common trick is to artificially increase diversity by introducing some transformations of the original data to simulate extra training samples. This is called data augmentation. For example, images may be randomly cropped, resized or rotated or have their colours scaled in some way. While effective and very commonly used, a more principled approach that is now attracting attention is to design networks that have invariance to those transformations by construction to avoid the need for augmentation.

At this point you may be scratching your head. This course began by criticising “feature engineering” as relying on handcrafted rules. However, we seem to have replaced this with handcrafted network architectures. For example, we need to decide how many layers to use, how many filters, stride, padding, when to switch to fully connected layers, learning rate and so on. This is a very fair criticism. While these choices are indeed somewhat arbitrary, we do now have a good idea of what choices work well in practice and some theoretical explanations for why this is the case. Also, there is a branch of deep learning which tries to make these decisions themselves learnable. The idea is to optimise not only the weights of a given architecture but the architecture itself. This is called neural architecture search and has already provided promising results.


  1. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems 25 (2012): 1097-1105.
  2. He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
© University of York
This article is from the free online

Intelligent Systems: An Introduction to Deep Learning and Autonomous Systems

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education