Welcome to week four. In this week, we’ll look at covariance functions, what covariance functions are, what types of covariance functions exist, and how we can use them for shape modelling.
By now, you all know how to define a Gaussian Process. We define a Gaussian Process by specifying a mean function and a covariance function. The covariance function is also often referred to as a kernel function. The mean function defines how the average deformation looks like. The covariance function, in turn, defines how the typical deformation deviates from this mean. We have already seen an example how we can define a mean and a covariance function. What we did is, we took example data sets and then estimated the mean and covariance function from the example data. What we would like to discuss now is how can we specify a Gaussian Process when we don’t have data sets available?
If we don’t have any data, than the mean function is usually quite uninteresting. So, we would need to specify how does a typical deformation deviate from the mean? But, if we don’t have any prior knowledge about that, it’s usually difficult to specify that. So, if we can assume that the reference shape that we’re choosing is kind of average, then we just say the average deformation is just a zero deformation.
The covariance function is much more interesting. It defines the characteristics of the deformation fields. While in general, it is very difficult for different shapes to find common characteristics, there is at least one assumption that holds for pretty much all the shapes that we want to model. And this assumption is that deformation fields are smooth. What this means is that neighbouring points should have somehow similar deformations, so there is no folding or overlapping of different deformations.
On top of that, there is also a mathematical requirement that a covariance function has to fulfil. And this is the covariance function should be a symmetric, positive, definite kernel. So, symmetric means that k of x, x prime, is the same as k of x prime x, so I can just interchange the two arguments. What does it mean to be positive semi-definite? You might be familiar with the concept of a positive semi-definite matrix.
A positive semi-definite matrix is just a matrix for which it holds that if I take an arbitrary vector v and then multiply the matrix with this v transposed and then times v, so I have this quadratic form that I form, then independently of which vector v I choose, the result should always be greater or equal to zero. This is the property of positive semi-definiteness of matrixes.
Now, a positive semi-definite kernel is very closely related.
We say a kernel is positive semi-definite if, when we discretise the kernel, so when we evaluate it on a finite number of points as we always did when we go from the continuous to the discrete representation, so I just choose a number of points x and then evaluate my kernel function and build the covariance matrix out of it, then this matrix has to be positive semi-definite independently of which point that I choose. And this is actually a very reasonable requirement, because this just ensures that the normal distribution that I get when I discretise is actually a valid, normal distribution.
Because if you remember, the covariance matrix that a valid covariance matrix for a multivariate normal distribution, needs to be symmetric and positive definite.
Let us look at an example of a covariance function. The most classical kernel or covariance function that is used everywhere in the literature is called the Gaussian kernel. The Gaussian kernel is what is called a scalar valued kernel. So for us, when we’re modeling vector fields, what we need is a matrix valued kernel. Because remember, the kernel or covariance function, it tells us how are the two function values at the point x and x prime, how are they correlated? If our function values are vectors, then we have two components and to describe the correlation, we need a 2 times 2 matrix.
But if we would only have real functions that we want to model, then it would be enough just to get one scalar value for describing this correlation. And so, one example of such a scalar value kernel is a Gaussian kernel. The Gaussian kernel is defined by this function here. And the main part is just the formula e to the power minus, then the distance between the two input points, divided by this parameter sigma. What it means is the closer the points are together, the higher is the function value, just as shown here in this plot. So, it’s this typical bell curve. We observe that if the points are far apart, then it goes to zero.
Which means points that are far apart, they don’t correlate anymore. So, the shape differs if I increase my parameter sigma, then I get correlations over a wider range of points. The parameter s somehow tells us something about how much variance I will be modelling. It’s somehow the amplitude of this bell curve. So, we see here if we increase the function values, then we have a higher amplitude of that curve, which translates to a higher variance in the process at the end.
Now we’ve started with scalar valued covariance function. And from the scalar valued covariance function, we can build the matrix valued covariance functions we need for modelling the information fields. The simplest such construction is to use a diagonal kernel. Assume we’re given d different scalar valued kernels and d here is just the dimensionality of the deformation field we want to model. So in our case, it would just be two, when we are working in 2D. So, this thing here would be a 2 times 2 matrix, and I would just have two scalar valued kernels on the diagonals. And all the other entries would be zero.
What this means is that the zero mean, I’m actually modelling each dimension independently, so there is no correlation between the x and the y component of a vector. Now, our original goal was that we kind of wanted to model smooth deformation fields. We are now in a position where we can do that. So we take a Gaussian kernel and we take this diagonal approach to construct out of the scalar valued Gaussian kernel, the matrix valued covariance function, by simply adding here on the diagonals the different Gaussian kernels. And now let’s look at some examples how this looks like. What I have shown here are different samples from a Gaussian Process that was constructed an that way.
The first thing we observe is that while before I always showed deformation only on the contour of the hand, here I show it in the region around the hand. And this is because these type of covariance functions, they are defined in the whole real space. So, I can also sample from it everywhere, if I want to. The first example that we see is where we have chosen both the scale factor s1 and s2 large and sigma 1 and sigma 2 for both components. We see that we have kind of nice, smooth components. So, kind of all the vectors there moving smoothly in this image.
Now. If we choose s1, s2 small, then what we’ll see is that all of the vectors on that image are actually small, so we don’t have much variation that we model or no big components that we model.
We could choose s1 small. So, the first component has almost no variance while the second component has a lot of variance. We see two samples of such a vector field here. What we observe is that we have now a dominant direction. So, most arrows actually go now in y direction. And also in the second sample here, we see here the dominant direction.
If you would choose s1 and s2 larger again, but sigma– so the kind of smoothness parameter– smaller, then we would get deformation fields that are very wild and wiggly.
Now, the nice thing about Gaussian Process modelling with Gaussian Processes is that we can actually do much more than simply defining diagonal kernels. We can really model these different kernels. There are rules that this semi-definite kernels need to obey such that they are still remain positive definite, but there is kind of an entire algebra that I can use to build new kernels based off existing kernels. Once such example is that I can take the sum of two positive semi-definite kernels, and it remains a positive semi-definite kernel. The interpretation of that is that two function values are correlated if they’re correlated in one kernel g or the second kernel h. So I can express this kind of ‘or’ relationships.
And the sum of two kernels is actually ideal for defining deformations that live on multiple scales. So for example, if I want to model deformations like tiny deformations which have the small scale, kind of high frequency, together with large deformations, which are very smooth. And these type of models are useful for shape modelling. I have a second possibility. I can also multiply two kernels together. And this is not an ‘or’ relationship anymore. This is now an ‘and’ relationship. The function values that I obtain in the new kernel, they’re only correlated if they’re correlated in the kernel g and the kernel h. And this is especially good for localising correlation.
So, what I could just choose one of the two kernels to be kind of local, such that the correlation value is zero on the large area. And this would effectively annihilate all the correlations that I have. So, I can kind of model local effects here. There are many other rules that I can use for combining kernels. And we can really think of that Gaussian Process gives us a language for modelling. And we will explore such interesting models that come from combining different kernels in the following article.