Skip main navigation

Image formation

In this video, QUT robotics professor Peter Corke demonstrates how a central projection model can be used for image formation.
Now we’re going to look at a more mathematical way to describe the projection process, projection of a point from the real world into the image plane. We’re going to use a different projection model to what we used last time. We’re going to use a model that’s referred to as the “central projection” model. The key elements of this model are the camera’s coordinate frame we denote by C. The image plane is parallel to the camera’s x and y axes and positioned at a distance F in the positive z direction. F is equivalent to the focal length of the lens.
Now in order to project the point, what we do is we cast a ray from the point in the world through the image plane to the origin of the camera. With the central projection model, you’ll note that the image is non-inverted. We can write an equation for the point P in homogeneous coordinates, we multiply the world coordinates, X, Y, Z, by a three by four matrix in order to get the homogeneous coordinates of the projected point on the image plane. Let’s look at this equation in a little bit more detail. It’s quite straightforward to write an expression for x tilde, y tilde, z tilde in terms of the focal length and the world coordinate, X, Y, Z.
We can transform the homogeneous coordinates to Cartesian coordinates using the rule that we talked about in the last section and with a little bit of rearrangement, we can bring the equation into this form and this is exactly the same form as we derived in the last lecture by looking at similar triangles. What’s really convenient and useful about this homogeneous representation of the image formation process is that it is completely linear. We don’t have this explicit division by Z, the distance between the camera and the object. It’s implicit in the way we write the equations in homogeneous form. Let’s look at this equation again and we can factor this matrix into two.
The matrix on the right has elements that are either 0 or 1 or f, the focal length of the lens. So this matrix performs the scaling and zooming. It’s a function of the focal length of our lens. The matrix on the left has got an interesting shape, it’s only a three by four and this matrix performs the dimensionality reduction, crunches points from three dimensions down into two. And so far, we consider the image plane to be continuous. In reality, the image plane is quantized. It consists of a massive array of light sensing elements which correspond to the pixels in the output image. The dimension of each pixel in this grid, I’m going to denote by the Greek letter rho.
So the pixels are Ρu wide and they’re Ρv high. Pixels are really, really small so the width and height of a pixel is often at the order of around 10 microns, maybe a bit bigger, maybe a bit smaller. What we need to do now is to convert the coordinate P, which we computed previously, and that was in units of meters with respect to the origin of the image plane. We need to convert it to units of pixels. Pixel coordinates are measured from the top-left corner of the image so we need to do a scaling and we need to do a shifting and that’s a simple linear operation.
So if we have the Cartesian x and y coordinates of the point P on the image plane, we can convert that to the equivalent pixel coordinate which we denote by the coordinates u and v and we can represent that again in homogeneous form. Here we multiply by a matrix, the elements of the matrix are the dimensions of the pixel, Pu and Pv, and the coordinates of what’s called the principal point. The principal point is the pixel coordinate where the z axis of the camera origin frame pierces the image plane. The homogeneous pixel coordinates can be converted to the more familiar Cartesian pixel coordinates u and v by the transformation rule that we covered earlier.
Essentially, we take the first and second element of the homogeneous vector and divide it by the third element of the homogeneous vector. Now, we can put all these pieces together and we can write the complete camera model in terms of three matrices. The product of the first two matrices is typically denoted by the symbol K and we refer to these as the intrinsic parameters. All the numbers in these two matrices are functions of the camera itself. It doesn’t matter where the camera is, or where it’s pointing, they’re only a function of the camera. These numbers include the height and width of the pixels on the image plane, the coordinates of the principal point, and the focal length of the lens.
The third matrix describes the extrinsic parameters and these describe where the camera is, but they don’t say anything about the type of camera. The elements in this matrix are a function of the relative pose of the camera with respect to the world origin frame. In fact, it is the inverse of xi C. The product of all of these matrices together is referred to as the camera matrix and it’s often given the symbol C. So this single matrix is single three by four matrix is all we need to describe the mapping from a world coordinate, X, Y and Z, through to a homogeneous representation of the pixel coordinate on the image plane.
That homogeneous image plane coordinate can be converted to the familiar Cartesian image plane coordinate using this transformation rule here. So this is a very simple and concise way of performing perspective projection. Let’s consider now what happens when I introduce a non-zero scale factor lambda. The homogeneous coordinate elements u tilde, v tilde, and w tilde will all be scaled by lambda. When I convert them to Cartesian form, the lambda term will be factored out to the numerator and the denominator so the result will be unchanged. This is a particular advantage of writing the relationship in homogeneous form. It gives us what’s called scaling variance.
Because we can multiply the camera matrix by an arbitrary scale factor, it means we can write the camera matrix in a slightly simplified form, which we refer to as a normalised camera matrix. We do that by choosing one particular element of that matrix to have a value of one and typically we choose the bottom-right element and set it to one. This normalised camera matrix still contains all of the information to completely describe the image formation process. It contains the focal length of the lens, it contains the dimensions of the pixels, it contains the coordinate of the principal point, and it contains the position and orientation of the camera in three-dimensional space.
And finally, we can convert the homogeneous pixel coordinates to the more familiar Cartesian pixel coordinates, which we denote by u and v.

This video explains how to use a central projection model to project a point in the world through the image plane to the origin of a camera.

Here we will describe the relationship between a 3D world point and a 2D image-plane point, both expressed in homogeneous coordinates using a linear transformation – a 3 x 4 matrix.

This article is from the free online

Robotic Vision: Making Robots See

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now