Skip main navigation

Advanced network features

A article describing the key features of common deep learning encoder architectures, including LeNet, VGG, Google Inception and ResNet.

A closer look at some of the important innovations used in encoder network architectures.

In the video we talked about a few different network architectures, including LeNet, VGG, ResNet and Google Inception, but how exactly do they work? How does a ResNet differ from other CNNs? In this article we will briefly explain how some of these network architectures are constructed.

LeNet, AlexNet & VGG

The original deep learning architecture, published in 1998 by Yann Le Cun, is called LeNet, and it’s very similar in many ways to the simple network we have put together in the exercises.

A flow diagram of the LeNet architecture. Input leads to a convolutional layer, then a smaller max pooling layer, then a convolutional layer, then a yet smaller max pooling layer, then a convolutional layer, then a smaller still max pooling layer, then finally a fully connected layer before the output.

There are a series of three convolutional layers, with increasing numbers of output channels, and with pooling layers in between each convolutional layer. The network is completed with two fully connected layers, the second of which is the output layer corresponding to the ten classes. Most networks, our exercise included, use max pooling layers and ReLU activation functions, while the original LeNet had average pooling layers and used tanh activation functions, but otherwise many networks have this same overall structure. There are convolutional layers that increase the ‘depth’ of the data in one dimension, while the pooling layers decrease the amount of information in the spatial dimension, before finishing with one or more fully connected layers, prior to the final output or decision layer.

Subsequent networks such as AlexNet (REF) and VGG (REF) keep this basic structure while increasing the number and size of the convolutional layers, up to as many as nineteen layers in the case of VGG. But the principle is the same, convolutional layers, some of which with pooling layers in between, then fully connected layers to complete the networks.

Google Inception

An important network innovation we haven’t yet seen in our examples are so-called Inception modules. The name was inspired from the “dreams within dreams” structure of the 2010 film of the same name, because each individual inception module can be thought of as a (small) network within a network. What they do is perform several sets of convolutions in parallel, with different kernel sizes, for example 1×1, 3×3 and 5×5, plus a max pooling branch, with the padding and stride set so that the output of each convolution keeps the same spatial size as the input. This means that the all the resulting convolution channels can be stacked in the usual way and fed into the following layer.

A flow digram labelled Inception module. There are three blocks inside a larger block, labelled 1 x 1 kernel convolution, 3 x 3 kernel convolution, and 5 x 5 kernel convolution respectively, each are connected to an input arrows and output arrows outside the larger block. The output is labelled depth concatenation.

For example, the Inception module in the figure above has X 1×1 convolutions, X 3×3 convolutions, X 5×5 convolutions, and finally a 3×3 max pooling branch.

By including parallel branches of varying kernel sizes, inception modules allow the network to consider the image at different scales, and use the result to train model parameters accordingly. So in some parts of the network a 3×3 kernel might be more effective and be weighted as such, while in other parts a kernel of a different size might be more important.

ResNet

The other network innovation we will discuss in this article is the ResNet, short for Residual Network. This is similar to the Inception module in that we have parallel pathways at certain points in the network, but here rather than different kernel sizes in parallel, the ResNet passes through the input unchanged in one branch, while the other branch undergoes one or more convolutions. The output of these two branches is then added together after the so-called “skip connection”, rather than stacked as in the inception modules.

A flow diagram of a residual unit. An arrow leads from input to convolutional layers to ouput, while another arrow labelled skip connection leads directly to the output. There is a plus symbol between the two arrows leading to the output

The rationale behind these skip connections is that if at certain points in the network the convolutions aren’t adding much to the training process, then the unchanged signal is still passed through the network, perhaps to layers where the convolutions are proving effective in training the network. It may be that later on during training the layers that were initially skipped may prove valuable in refining the network further. In this way the skip connections are an effective way of speeding up the training process. In fact, ResNets have proved so effective that time of writing they are probably by the most commonly used encoder architecture used for classification problems in deep learning.

Images (c) The University of Nottingham

This article is from the free online

Deep Learning for Bioscientists

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now