Skip main navigation

Deep learning wrap-up

In development.
Multi-coloured coding on screen
© University of York

We’ve now covered enough of the basics to be able to do some amazing things with deep learning. CNNs (such as the VGG architecture) trained with stochastic gradient descent using backpropagation enables good performance on many computer vision tasks.

However, that’s not the end of the story. There are many subtle issues, tricks and clever hacks that improve performance. We’ll briefly mention some of these here so that you have a flavour of the sorts of things you have to think about as a deep learning engineer.

Programming languages and deep learning toolboxes

If you’re feeling inspired and now want to try out some practical deep learning, you will need to be competent in a programming language (most commonly python) and a deep learning toolbox (the most popular is pytorch followed by tensorflow). Both python and pytorch/tensorflow are free and can be installed on almost any computer. If you work through the official introductory tutorials at or you will quickly get a flavour of how easy it is to construct and train deep neural networks.

Small filters are better

It’s a good idea to use small filters in each CNN layer. The breakthrough CNN paper in 2012 proposed the AlexNet architecture. This used 7×7 filters. This means you need 49 x C parameters per filter. The VGG architecture proposed the use of 3×3 filters (9 x C parameters per filter). By having fewer parameters per layer, you can have more depth without increasing the number of parameters. Depth turns out to be the key to good performance.

Vanishing and exploding gradients

When backpropagating through deep network architectures, it is possible that the gradient of the loss becomes either very small or very large with respect to a given parameter. When the gradient becomes very small (the “vanishing gradient problem”), it can take a very long time to train the network. When the gradient becomes very large (the “exploding gradient problem”) each update can make very large changes to parameters leading to unstable behaviour. Also, because a computer has only limited precision with which to represent numbers, if the gradient gets small enough it can vanish to zero and if large enough it can “overflow” so that it can no longer be represented.

A very helpful trick to deal with both problems is batch normalisation. This is effectively an extra function that is applied to the inputs of each layer to avoid them getting very large or very small. It is now almost always used in network training.

Residual connections

Another breakthrough happened in 2015. A residual neural network (ResNet) adds skip connections to the sorts of CNN architectures we’ve seen so far. This connects the outputs of layer l-2 to the inputs of layer l, i.e. it skips layer l-1. The inputs to layer l are the sum of the outputs of layers l-2 and l-1.

Residual neural network schematic

This helps with vanishing gradients since, during backpropagation, there are paths from the loss back to the parameters that are shorter than in a conventional CNN. It also fixes a problem that accuracy tends to saturate with increasing depth in a conventional CNN, i.e. at some point adding more depth doesn’t improve accuracy. This problem is much reduced with a ResNet allowing networks with far greater depth to be trained.

Data Augmentation

Diversity of training data is always a problem, leading to poor generalisation. A common trick is to artificially increase diversity by introducing some transformations of the original data to simulate extra training samples. This is called data augmentation. For example, images may be randomly cropped, resized or rotated or have their colours scaled in some way. While effective and very commonly used, a more principled approach that is now attracting attention is to design networks that have invariance to those transformations by construction to avoid the need for augmentation.

At this point you may be scratching your head. This course began by criticising “feature engineering” as relying on handcrafted rules. However, we seem to have replaced this with handcrafted network architectures. For example, we need to decide how many layers to use, how many filters, stride, padding, when to switch to fully connected layers, learning rate and so on. This is a very fair criticism. While these choices are indeed somewhat arbitrary, we do now have a good idea of what choices work well in practice and some theoretical explanations for why this is the case. Also, there is a branch of deep learning which tries to make these decisions themselves learnable. The idea is to optimise not only the weights of a given architecture but the architecture itself. This is called neural architecture search and has already provided promising results.


  1. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems 25 (2012): 1097-1105.
  2. He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
© University of York
This article is from the free online

Intelligent Systems: An Introduction to Deep Learning and Autonomous Systems

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now