Advanced Machine Learning: Gaussian Processes
Aside: Kriging
Gaussian processes have received a lot of attention from the machine learning community over the last decade. However they were originally developed in the 1950s in a master thesis by Danie Krig, who worked on modeling gold deposits in the Witwatersrand reef complex in South Africa. The mathematics was formalized by the French mathematician Georges Matheron in 1960. The method developed was known as kriging, and this term remains common in geo-statistics literature.Gaussian processes are simple to understand, but require one important change of perspective. The change of perspective is this: We consider the values of the target variable to be random variables whose multivariate probability distribution is a multivariate Gaussian. The mean of this multivariate Gaussian is typically 0 (if this is unreasonable, we can center the target variables so that their sample mean is 0), and the covariance matrix, \(\Sigma\) is a function of the input features such that \(\Sigma_{i,j}=k(X_i,X_j)\), where \(k\) is some kernel function.Consider the following training data:
\(X\) | \(Y\) |
---|---|
4.5 | 2.1 |
1.6 | 4.3 |
3.5 | 1.7 |
Aside: The interested student can see the process for calculating conditional multivariate Gaussian distributions in the Calculating conditional multivariate Gaussian distributions document available in the downloads section at the end of this article.



Aside: Standard Deviations and Probability Density
For a Gaussian distribution, we know the proportion of the probability density that lies within an interval of a particular number of standard deviations centered at the mean. This allows a simple way of generating confidence intervals (though using the probability density directly is not significantly more difficult). Commonly used intervals are:\(\mu \pm n \sigma\) | \(\approx\) density proportion | Used for |
---|---|---|
\(n=1\) | .6827 | .68 |
\(n=2\) | .9545 | .95 |
\(n=3\) | .9973 | .995 |
Variance at the training points
We noted that the variance at training points drops to 0. This means that we are sure that if we see an \(X\) value that we had in our training data, we are certain that the corresponding \(Y\) value will match that of the training example exactly. Except for the rare case of modeling mathematical functions, this is unreasonable.We can avoid this by introducing a hyper-parameter, \(\lambda\), which is added to the diagonal of the kernel matrix when calculating the covariance matrix, \(\Sigma\). This leads to:\[\Sigma = \begin{bmatrix} \Sigma_{1,1}=k(X_1,X_1) + \lambda & \Sigma_{1,2}=k(X_1,X_2) & \ldots & \Sigma_{1,1}=k(X_1,X_n) \\ \Sigma_{2,1}=k(X_2,X_1) & \Sigma_{2,2}=k(X_2,X_2) + \lambda & & \vdots \\ \vdots & & \ddots & \\ \Sigma_{n,1}=k(X_n,X_1) & \ldots & & \Sigma_{n,n}=k(X_n,X_n) + \lambda \\ \end{bmatrix}\]Let’s view models with \(\lambda \geq 0\):

Reversion to Mean
An interesting characteristic of Gaussian processes is that outside the training data they will revert to the process mean. The speed of this reversion is governed by the kernel used. This contrasts with many non-linear models which experience ‘wild’ behaviour outside the training data – shooting of to implausibly large values. An example is given with polynomial regression below (ignore the fact that the polynomial regression model is wildly overfitted, wild behaviour outside the training data can happen even on models that are not overfitted):
Training
When working with Gaussian processes, you will need to try a variety of kernels and values for the \(\lambda\) hyper-parameter. Additionally, the kernels themselves will typically have a number of parameters that need to be specified and which function as hyper-parameters for the Guassian process training algorithm.Typically you would decide on values for these hyper-parameters using a validation method. So as to make sure you select a good kernel function, you would be seeking to optimize the likelihood of the validation data rather than, for example, the mean squared error of the validation data to the expected value regression curve.Online Learning
Online learning is when the model updates itself to incoming data. It is simple to perform online learning with Gaussian processes. As new labelled data cases become available, the process covariance matrix is updated.Distribution over Functions
Consider that we can obtain conditional distributions for an arbitrary number of new \(X\) values at once. Conceptually, we can even consider the case where we work with uncountably many new \(X\) values, allowing us to obtain a conditional distribution for every point in the input space, given the training data. This conditional distribution is a distribution over a function on the input space.Below are two examples where we sample a function from a Gaussian process generated from the training data:\(X\) | \(Y\) |
---|---|
4.5 | 2.1 |
1.6 | 4.3 |
3.5 | 1.7 |
1.4 | 4.4 |
1.1 | 4.3 |
0.7 | 5.1 |


Scalability
Gaussian processes scale to large data even worse than most other kernel methods. We often have to calculate the kernel matrix in kernel methods, with a complexity of \(N^2\) for a data set with \(N\) rows. But for Gaussian processes, we have to also calculate the inverse of this matrix, leading to a complexity of \(N^3\).There are a number of ways of reducing the complexity by using approximations to either the covariance matrix or the inverse, such as approximating the inverse using numerical methods, or approximating the covariance matrix using a banded matrix and then approximating the inverse of this banded matrix by another banded matrix. We do not know how popular these approximation methods are in real life, but do think it fair to say that Gaussian processes are typically used with reasonably small data sets.Our purpose is to transform access to education.
We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.
We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.
Learn more about how FutureLearn is transforming access to education