Skip main navigation

Deep Learning History

As machine learning engineers, part of our job is to decide on the form of the function that maps inputs to outputs. We’re going to take a brief historical look at what has led to the conclusion that something called deep learning is the best option and what that means in terms of the form of the functions we should use.
Deep learning graphic - like The Matrix
© University of York

As machine learning engineers, part of our job is to decide on the form of the function that maps inputs to outputs. We’re going to take a brief historical look at what has led to the conclusion that something called deep learning is the best option and what that means in terms of the form of the functions we should use.

Feature Engineering

Usually, our raw input data (such as an audio stream or image) is very high dimensional. This means that we have many thousands or even millions of input values. Devising a function by hand that can deal with such complexity is very hard and this led researchers for several decades to focus on something called feature engineering. The idea was to handcraft, i.e. design by hand, or engineer, some features. These features are things that can be extracted from your raw data that somehow summarise or simplify your high dimensional input.

Edge Detection

For example, let’s say that you want to recognise an object in an image. You might decide that the outline of the object is a good feature to use. So, now you need a way to extract the boundary of objects from images. This task is called edge detection. Perhaps you define an edge as a location where the colour changes rapidly (i.e. as you cross the boundary from object to background, you expect a sharp change in colour). But now you start to run into problems. What threshold should you use to define when a change in colour is caused by an edge? Will that threshold always work? What about when an object is in front of a background that is the same colour as the object, so there is no obvious edge? What about objects with internal texture that will cause lots of edges to be detected within the object?

You can probably think of lots more problems. And, who’s to say object boundaries are a good feature to use anyway?

Despite its limitations, feature engineering was dominant in computer vision until the mid 2010s. Some methods were actually quite successful. For example, the Scale Invariant Feature Transform (SIFT – proposed in 1999) was a way of finding interesting points in an image and then describing them in a way that was very distinctive (so it could be found again in another image). The approach is still quite competitive with state-of-the-art techniques for some problems. However, even then, someone noticed in 2012 that if you took the feature descriptors produced by SIFT and applied a square root to the values, performance on many tasks improved by about 5%! This situation is clearly somewhat ridiculous. Why square root? And why SIFT in the first place? How can you be sure that there isn’t some small modification you could make to your features that would boost performance on the task you’re trying to solve?

End-to-End Learning

This argument motivates the idea of end-to-end learning. The idea is that you will learn in one go the entire mapping from raw input data to final output without hand engineering any of the features used to solve the problem. Hence, the machine learning algorithm will have to learn low level features (things that can be immediately calculated from the raw input), mid level features (more abstract concepts that arise out of a combination of low level features, probably with some invariance to unimportant sources of variation) and high level features (abstract descriptions of the contents of the input data). Taking the example of images, a low level feature might be an edge, a mid level feature an eye and a high level feature the identity of a face in an image. But remember – all of this will be learnt – you don’t design any of these features in advance.

Deep Learning

It’s clear that the function we’re trying to learn is going to be complicated. How on earth can an image be mapped to the identity of the face in the image? It turns out that the best way to construct functions of sufficient complexity is to build them out of a composition of lots of simple functions applied one after another. So, our overall function (f_w(x)) is defined as:

[f_w(x) = f^n_{w_n}(dots f^2_{w_2}(f^1_{w_1}(x)))]

This means we first apply (f^1) to (x) and this function has its own parameters (w_1). Then, we apply (f^2) to the result of (f^1), then (f^3) to that result and so on. The “deep” in deep learning refers to the application of many functions one after another to the input. This depth turns out to provide immense power to the overall function in terms of what it can represent.

References

  1. Lowe, David G. “Distinctive image features from scale-invariant keypoints.” International journal of computer vision 60.2 (2004): 91-110.
  2. Arandjelović, Relja, and Andrew Zisserman. “Three things everyone should know to improve object retrieval.” 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012.
© University of York
This article is from the free online

Intelligent Systems: An Introduction to Deep Learning and Autonomous Systems

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now