The deep learning renaissance
In the last couple of videos I’ve been a bit negative about perceptrons and multilayer perceptrons – and the preceding quiz hasn’t exactly made a good case for them!
In recent years, so-called “deep learning” approaches to machine learning have been successfully applied to computer vision, natural language processing, and speech recognition. Deep learning involves multilayer networks with many layers. Ideally, the lower layers learn about low-level features (like lines and edges in an image), intermediate layers learn how to put them together (like arms and legs, noses and tails, wheels and faces), and upper layers learn how to combine these into objects (is it a car? a cat? a cow?). The key is to use “high-capacity” models – that is, ones with many parameters – with several layers. Multilayer neural networks are a prime example.
A critical ingredient is the use of overwhelmingly larger quantities of data than has heretofore been possible. In former times, data mining researchers paid scant attention to images and speech signals. Of course, the world abounds with signal data – but it is generally unlabeled. Recently, large collections of labeled data have been created, stimulating the application of deep learning techniques to tasks such as image recognition, face verification, speech and language models.
The Large-scale visual recognition challenge is to classify images obtained from sources such as Flickr into 1000 possible object category classes. The training set contains 1.2M images, which were hand-labeled(!) based on the presence or absence of an object belonging to these categories. A random subset of 50,000 images (50 per class) was used as the validation set and 100,000 images (100 per class) as the test set. The test images are hidden to participants in the challenge and available only to the judges.
The percentage of times the target label is not one of the 5 highest-probability predictions is called the “top-5 error”. Classical machine learning methods struggle to get below 25% – far worse than human performance, which has been measured at 5.1% for the same test set. The 2015 challenge was won by a team from Microsoft Research Asia who achieved 3.6%, substantially outperforming humans. They used a multilayer architecture with 152 layers! (Don’t try setting this up in Weka.)
An important special case of object recognition, face recognition has been the subject of intense research for decades—and deep learning networks have transformed the field. If you have used photo apps that identify faces, you have probably been surprised (dismayed?) at how good they are. Controlled experiments on face verification, where the question is whether two mugshots belong to the same person or not, have yielded better-than-human performance.
More data gives even better results. Every time one of its 2 billion users uploads a photo to Facebook and tags someone, that person is helping the facial recognition algorithm – which, according to the company, is able to accurately identify a person 98% of the time. (According to the FBI, its own facial recognition technology identifies the correct person in a list of the top 50 people only 85% of the time.)
Researchers at Google create language models based on single-hidden-layer networks trained with vast amounts of data – 30 billion words. One model trains a neural network to predict upcoming words given their context. Another predicts nearby words within a certain distance before and after the source word. Here the number of “classes” equals the vocabulary size, which ranges from 10^5 to 10^9 terms, so the output is decomposed into a binary tree – for a V-word vocabulary it is then necessary to evaluate only log2(V) rather than V output nodes.
Many research groups are mining massive quantities of text data in order to learn as much as possible from scratch, replacing features that have previously been hand engineered by ones that are learned automatically. Large neural networks are being applied to tasks ranging from sentiment classification and translation to dialog and question answering. Google uses deep learning techniques to learn how to translate languages from scratch, based on voluminous data.
It is worth noting that these technologies can raise controversial ethical issues. Federal governments deploy face verification technology in the fight against international terrorism; airports use it to reduce lineups at immigration. The application of face recognition in widespread video surveillance has a profound effect on the balance between security and privacy, and other civil liberties. At the individual level, stalkers exploit end-user web services for face recognition.
Don’t try this at home! Weka is not capable of learning at such massive scales. Typically, specialist (difficult!) mathematics is used to simplify the problem computationally, with many heuristic shortcuts. For example, Microsoft’s 152-layer network (mentioned above) uses connections that skip over layers, and performs judicious pooling operations to reduce the flow of data. Most deep learning is performed on specialist hardware, typically graphics processing units (GPUs), which can yield execution speeds that are orders of magnitude faster than standard implementations.
Want to learn more?
I hate to self-advertise, but a recent book by myself and friends includes wonderful material on deep learning. Although pretty mathematical, by the standards of the field it’s a fairly accessible explanation of this area.
Data mining: Practical machine learning tools and techniques (4th edition), by Ian Witten, Eibe Frank, Mark Hall, and Christopher Pal. (Click here to order from Amazon.)
Note: this book is not required reading for this course.