Skip main navigation

What is Computer Vision?

Extracting useful information from images or seeking to understand images using computers is called computer vision. This is a very broad topic encompassing many problems, applications and techniques.
CCTV camera
© University of York

What is Computer Vision and Why is it Useful?

Extracting useful information from images or seeking to understand images using computers is called computer vision. This is a very broad topic encompassing many problems, applications and techniques. We’re going to focus on the use of deep learning to solve problems in computer vision. Before we do this, it’s reasonable to ask why this is interesting or useful. Why is computer vision useful? What sort of problems are we aiming to solve? What real world applications could be enabled by new computer vision techniques? We will answer these questions in this article by providing a very small sample of some of the exciting recent developments in computer vision.

Most of us now have several thousand photos stored on our mobile phones. The internet contains billions of images and billions of hours of video. To enable a user to find the content they are looking for or to extract meaning from all of this content requires computers to be able to analyse these images and define the content within the images. A very simple strategy is simply to run object or face recognition systems on each image and then add appropriate tags to each image. A user can then search tags via text-based search.

More interesting is to allow the user to search by image and retrieve images containing similar content. This problem can be solved by using deep neural networks to extract visual concepts from an image and then finding other images containing the same concepts:

Content-based retrieval

Even more challenging is to get a computer to summarise an image, not just listing objects but understanding actions and relations between objects:

Image labelling examples

3D Environment

Humans use their visual system to understand the 3D world as they move around. We avoid obstacles and dangers, plan routes to get to desired points, find and use objects and much else. This requires us to reconstruct a 3D environment from the two images received by our eyes and be able to reason about which scene elements are closer than others. When many images of the same scene are available, we can use geometric computer vision methods such as structure-from-motion and multiview stereo. These methods even work when the images were taken by different cameras and at different times of day with varying lighting. This, for example, allows highly detailed 3D models of landmarks to be reconstructed from tourist photographs:

3D construction from 2D photographs

While this was possible with classical computer vision methods that did not require deep learning, more recent advances are based on neural networks. For example, it is now possible to synthesise new viewpoints with photorealistic appearance and change the appearance and lighting:

NeRF in the wild See here for more sample results

Excitingly, we can even recover 3D information from a single image, effectively training a neural network to translate from a colour image to a depth map (an image where the value at each pixel tells us the distance to the scene in that direction). The following video is processing each frame independently but you can see that it recovers high quality depth maps that are smooth over time:

Monodepth

Besides 3D information, deep learning can also be used to label the meaning of every pixel in an image. This problem is known as semantic segmentation and provides a rich description of an image from which more complex decisions or understanding can be inferred:

Semantic segmentation

Computer Vision for Graphics

Computer vision is also being used to generate new content for the purposes of computer graphics or image editing. For example, the creation of deep fakes:

Deep video portraits

Building on the semantic segmentation mentioned above, there are now methods that enable synthesis of an image from the semantic labels. This allows a user to draw the scene layout they desire, then the system turns this into a photorealistic image:

GauGAN

Computer Vision in Science

Computer vision is also being deployed to solve fundamental problems in many areas of science. Medicine, neuroscience, biology, chemistry, astronomy and many other areas record images of one kind or another and being able to label abnormalities, detect features, remove noise and so on is of crucial importance. As a recent example, you may have seen in the news in 2020 that the first image of a black hole was created. This involved a very complex imaging technique using telescopes distributed across the Earth and then extracting a very weak signal from very noisy data.

Image of black hole

© University of York
This article is from the free online

Intelligent Systems: An Introduction to Deep Learning and Autonomous Systems

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education