# Advanced Machine Learning: Gaussian Processes

#### Aside: Kriging

Gaussian processes have received a lot of attention from the machine learning community over the last decade. However they were originally developed in the 1950s in a master thesis by Danie Krig, who worked on modeling gold deposits in the Witwatersrand reef complex in South Africa. The mathematics was formalized by the French mathematician Georges Matheron in 1960. The method developed was known as*kriging*, and this term remains common in geo-statistics literature.

Gaussian processes are simple to understand, but require one important change of perspective. The change of perspective is this: We consider the values of the target variable to be random variables whose multivariate probability distribution is a multivariate Gaussian. The mean of this multivariate Gaussian is typically 0 (if this is unreasonable, we can center the target variables so that their sample mean is 0), and the covariance matrix, \(\Sigma\) is a function of the input features such that \(\Sigma_{i,j}=k(X_i,X_j)\), where \(k\) is some kernel function. Consider the following training data:

\(X\) | \(Y\) |
---|---|

4.5 | 2.1 |

1.6 | 4.3 |

3.5 | 1.7 |

Aside: The interested student can see the process for calculating conditional multivariate Gaussian distributions in the

*Calculating conditional multivariate Gaussian distributions*document available in the downloads section at the end of this article.

We examine this model graphically. We see that it produces a regression curve, with confidence intervals around it. The confidence intervals here are chosen to be three times the standard deviation at each point, which equates to approximately 99.5% confidence (we estimate that the true values of Y at each X point lies within this interval with .995 probability). Note the confidence intervals vanish to 0 at the data points (something which we will return to below), and expand as we get further away from training examples. For our new point, we obtain both the expected value, and the confidence intervals around it. These confidence intervals are calculated from the variance (i.e. the squared standard deviation) estimate of \(\hat{Y}_*\), using the process explained above. Of course, what we really receive is a probability distribution for the value of \(Y\) at \(X=3.7\). The confidence intervals are simply a means of displaying an interval of sufficient confidence for the problem we are modeling: 99.5% of the density of the (vertically) displayed probability distribution lies between the confidence intervals. If you are wondering how we get the regression and confidence interval curves, like any function curves, these are simply generated for each X value (in reality a sequence of X values and then joined).

#### Aside: Standard Deviations and Probability Density

For a Gaussian distribution, we know the proportion of the probability density that lies within an interval of a particular number of standard deviations centered at the mean. This allows a simple way of generating confidence intervals (though using the probability density directly is not significantly more difficult). Commonly used intervals are:\(\mu \pm n \sigma\) | \(\approx\) density proportion | Used for |
---|---|---|

\(n=1\) | .6827 | .68 |

\(n=2\) | .9545 | .95 |

\(n=3\) | .9973 | .995 |

## Variance at the training points

We noted that the variance at training points drops to 0. This means that we are sure that if we see an \(X\) value that we had in our training data, we are*certain*that the corresponding \(Y\) value will match that of the training example exactly. Except for the rare case of modeling mathematical functions, this is unreasonable. We can avoid this by introducing a hyper-parameter, \(\lambda\), which is added to the diagonal of the kernel matrix when calculating the covariance matrix, \(\Sigma\). This leads to: \[\Sigma = \begin{bmatrix} \Sigma_{1,1}=k(X_1,X_1) + \lambda & \Sigma_{1,2}=k(X_1,X_2) & \ldots & \Sigma_{1,1}=k(X_1,X_n) \\ \Sigma_{2,1}=k(X_2,X_1) & \Sigma_{2,2}=k(X_2,X_2) + \lambda & & \vdots \\ \vdots & & \ddots & \\ \Sigma_{n,1}=k(X_n,X_1) & \ldots & & \Sigma_{n,n}=k(X_n,X_n) + \lambda \\ \end{bmatrix}\] Let’s view models with \(\lambda \geq 0\): We see that variance no longer drops to 0 at training points. Treating the expected value as a regression estimate, we also see that the model now has residuals (errors) on the training data. This is because the expected values at these points are dragged towards the process mean.

## Reversion to Mean

An interesting characteristic of Gaussian processes is that outside the training data they will revert to the process mean. The speed of this reversion is governed by the kernel used. This contrasts with many non-linear models which experience ‘wild’ behaviour outside the training data – shooting of to implausibly large values. An example is given with polynomial regression below (ignore the fact that the polynomial regression model is wildly overfitted, wild behaviour outside the training data can happen even on models that are not overfitted):## Training

When working with Gaussian processes, you will need to try a variety of kernels and values for the \(\lambda\) hyper-parameter. Additionally, the kernels themselves will typically have a number of parameters that need to be specified and which function as hyper-parameters for the Guassian process training algorithm. Typically you would decide on values for these hyper-parameters using a validation method. So as to make sure you select a good kernel function, you would be seeking to optimize the likelihood of the validation data rather than, for example, the mean squared error of the validation data to the expected value regression curve.## Online Learning

Online learning is when the model updates itself to incoming data. It is simple to perform online learning with Gaussian processes. As new labelled data cases become available, the process covariance matrix is updated.## Distribution over Functions

Consider that we can obtain conditional distributions for an arbitrary number of new \(X\) values at once. Conceptually, we can even consider the case where we work with*uncountably*many new \(X\) values, allowing us to obtain a conditional distribution for every point in the input space, given the training data. This conditional distribution is a distribution over a function on the input space. Below are two examples where we sample a function from a Gaussian process generated from the training data:

\(X\) | \(Y\) |
---|---|

4.5 | 2.1 |

1.6 | 4.3 |

3.5 | 1.7 |

1.4 | 4.4 |

1.1 | 4.3 |

0.7 | 5.1 |

## Scalability

Gaussian processes scale to large data even worse than most other kernel methods. We often have to calculate the kernel matrix in kernel methods, with a complexity of \(N^2\) for a data set with \(N\) rows. But for Gaussian processes, we have to also calculate the inverse of this matrix, leading to a complexity of \(N^3\). There are a number of ways of reducing the complexity by using approximations to either the covariance matrix or the inverse, such as approximating the inverse using numerical methods, or approximating the covariance matrix using a banded matrix and then approximating the inverse of this banded matrix by another banded matrix. We do not know how popular these approximation methods are in real life, but do think it fair to say that Gaussian processes are typically used with reasonably small data sets.## Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.

You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education