Gaussian Processes :: A Different Way to Explain Data - 2012-07-14
Short overview of the basics of Gaussian Processes.
During my school years, we were taught to explain data points using least-squares regression. The basic formula for describing such a functional relationship is:
$$y = f(x) + \varepsilon$$
The simplest way to explain such a relationship is to assume an expression for $$f(x)$$ based on knowledge about the process. Assume that we have some measurements according to the plot below. Unless we have a sound knowledge about the way these measurements were acquired, we do not know if the points are describing a functional form exactly or if there should be a simple explanation with a lot of noise.
The values of these measurements are:
$$ \left( \begin{array}{c} x \ y \end{array} \right) = \left( \begin{array}{cccccc} 0.1 & 0.2 & 0.4 & 0.6 & 0.8 & 0.9 \ 0.2 & 0.5 & 0.7 & 0.4 & 0.3 & 0.2 \end{array} \right) $$
which generally can be written as:
$$\left( \begin{array}{c} x \ y \end{array} \right) = \left( \begin{array}{cccccc} x_1 & x_2 & x_3 & x_4 & x_5 & x_6 \ y_1 & y_2 & y_3 & y_4 & y_5 & y_6 \end{array} \right)$$
Assuming that the data can be explained using a second order polynomial with noise we say that:
$$f(x) = \beta_0 + \beta_1 x + \beta_2 x^2$$
There is one such equation per measurement point which gives us a system of equations:
$$\left{ \begin{array}{rcl} y1 & = & \beta_0 + \beta_1 x_1 + \beta_2 x^2_1 \ y2 & = & \beta_0 + \beta_1 x_2 + \beta_2 x^2_2 \ y3 & = & \beta_0 + \beta_1 x_3 + \beta_2 x^2_3 \ y4 & = & \beta_0 + \beta_1 x_4 + \beta_2 x^2_4 \ y5 & = & \beta_0 + \beta_1 x_5 + \beta_2 x^2_5 \ y6 & = & \beta_0 + \beta_1 x_6 + \beta_2 x^2_6 \end{array} \right.$$
We have six equations and three unknowns ($$\beta_i$$). Rewriting these equations in matrix form gives:
$$\underbrace{ \left( \begin{array}{ccc} 1 & x_1 & x^2_1 \ 1 & x_2 & x^2_2 \ 1 & x_3 & x^2_3 \ 1 & x_4 & x^2_4 \ 1 & x_5 & x^2_5 \ 1 & x_6 & x^2_6 \end{array} \right)}{A} \underbrace{ \left( \begin{array}{c} \beta_0 \ \beta_1 \ \beta_2 \end{array} \right)}{\beta} = \underbrace{ \left( \begin{array}{c} y_0 \ y_1 \ y_2 \ y_3 \ y_4 \ y_5 \ y_6 \end{array} \right)}_{y}$$
or simply:
$$A \beta = y$$
To fit the guessed function structure to the measurement points in a least-square sense we want to minimize
$$\min_{\beta} ||f(x) - y||^2_2$$
and the residual is often assumed to be normally distributed if the model is feasible: ($$\varepsilon \sim N(0, \sigma^2)$$).
The solution to the minimization problem above is $$\hat{\beta} = \left( A^T A \right)^{-1} A^T y$$ The result is shown in the figure below. The question is whether this is a good model or not? If it is not, how could the function be tweaked to fit the points better? Obviously, the function candidate would be much more complicated.
There is a different way to approach this, and one way is to use a so-called Gaussian Process (also known as Kriging in some disciplines).
A Gaussian Process is a statistical method to find a distribution for a functional form by specifying a covariance function instead of a functional form. The covariance function tells us how likely one point is to be similar to another point. One such commonly used covariance function is the squared exponential:
$$k(x_i, x_j) = \exp \left( - \frac{1}{2} \frac{||x_i - x_j||^2_2}{\ell^2} \right)$$
Here we use $$\ell$$ to define the characteristic length of the function, i.e. how fast the function may change. Another interpretation is that $$\ell$$ is a smoothness or curvature control.
Now, if we have a covariance function we can generate many functions which fulfill the given covariance function. First, calculate a covariance matrix for a set of points ($$X = {x_1, x_2, \dots, x_n}$$) which may define the grid we are evaluating the function at.
$$K\left( X \right) = \left( \begin{array}{cccc} k(x_1,x_1) & k(x_1,x_2) & \cdots & k(x_1,x_n) \ k(x_2,x_1) & k(x_2,x_2) & \cdots & k(x_2,x_n) \ \vdots & \vdots & \ddots & \vdots \ k(x_n,x_1) & k(x_n,x_2) & \cdots & k(x_n,x_n) \end{array} \right)$$
If we generate a set of random numbers $$\xi \sim N(0,1)$$ and calculate
$$y = \mu + L\xi$$
then $$y \sim N \left(\mu, K(X,X)\right)$$.
$$L$$ is defined as the Cholesky decomposition of the covariance matrix:
$$L L^T = K(X,X)$$
Everything becomes interesting if we can predict a set of points $$Y_*$$ given that we know what the outcome of some earlier measurements were:
$$Y_|X_,X,Y$$
where $$X_*$$ is the set of $$x$$-values where we want to make function predictions and $$X$$ and $$Y$$ are known measured values.
Doing such a calculation is well known for multivariate joint normally distributed variables. A multivariate normal probability function is defined as
$$p(x|\mu,\Sigma) = (2\pi)^{-D/2}|\Sigma|^{-1/2} \exp\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu) \right)$$
where $$D$$ is the dimensionality, $$\mu$$ is the average value and $$\Sigma$$ is the covariance matrix. Two vectors $$x$$ and $$y$$ can have a joint normal distribution which is written as:
$$\left( \begin{array}{c} x \ y \end{array} \right) \sim N \left( \begin{array}{ccc} \left( \begin{array}{c} \mu_x \ \mu_y \end{array} \right) & , & \left( \begin{array}{cc} \Sigma_{11} & \Sigma_{12} \ \Sigma_{21} & \Sigma_{22} \end{array} \right) \end{array} \right)$$
If we know that $$y = \xi$$ we can calculate what $$x$$ becomes to fulfil the joint covariance, i.e. $$x|y = \xi \sim N \left( \bar{\mu}, \bar{\Sigma}\right)$$.
$$\left{ \begin{array}{ccc} \bar{\mu} & = & \mu_x + \Sigma_{12} \Sigma^{-1}{22} (\xi - \mu_y) \ \bar{\Sigma} & = & \Sigma{11} - \Sigma_{12} \Sigma^{-1}{22} \Sigma{21} \end{array} \right.$$
This expression means that we, given a set of measurements and a covariance function, can calculate the expected outcome at any new point $$x$$ for $$f(x)$$ and we can also easily calculate the variance (and hence the 95% confidence interval as $$1.96 \cdot \sigma^2$$ for the predicted value of $$f(x)$$ for each $$x$$. Note that the expected value is a function of measured values and that the covariance does not depend on measured values at all. This gives us the possibility to choose where to take the next sample to gain most information given all previous samples and a covariance function.
The joint distribution for measured values $$Y$$ and predicted values $$Y_*$$ with a noise term added is:
$$\left( \begin{array}{c} Y \ Y_* \end{array} \right) \sim N\left(\left(\begin{array}{c} 0 \ 0 \end{array}\right), \left(\begin{array}{cc}K(X,X) + \sigma_n^2 I & K(X,X_) \ K(X_,X) & K(X_,X_) \end{array} \right) \right) $$
We can now write the distribution for the predicted set of function outputs $$Y_*$$ as:
$$\left. Y_* \right| X_, X, Y \sim N\left( \underbrace{K(X_,X) \left(K(X,X) + \sigma^2_n I \right)^{-1} Y}{\text{‘’expected function’’}}, \underbrace{K(X,X_) + \sigma^2_n I - K(X_,X) \left(K(X,X) + \sigma^2_n \right)^{-1} K(X,X_)}_{\text{function uncertainty}} \right)$$
Calculating the matrices $$K$$ and evaluating the expected value and variance for a grid gives the following result:
The black line is the previous 2nd order polynomial. The red line is the expected function output and the dashed lines are upper and lower 95%-confidence intervals. By not forcing a parameterized function and instead specifying the relation between points using a covariance function we can fit functions in a much more non-linear manner for complex data.
The Python code used to generate the data for this plot is:
#!/usr/bin/env python2.7
import numpy as np
def covfun(x1, x2, ell):
k = np.exp(-0.5*(x1-x2)**2.0 / ell)
return k
# Characteristic length.
l = 0.1
# Noise variance.
s = 0.1
# Measured points.
x = np.array([0.1, 0.2, 0.4, 0.6, 0.8, 0.9])
y = np.array([0.2, 0.5, 0.7, 0.4, 0.3, 0.2])
# Grid to predict values for.
xg = np.linspace(0.0, 1.0, 21)
# Initialization of covariance matrix.
K11 = np.zeros((len(x), len(x)))
K12 = np.zeros((len(x), len(xg)))
K21 = np.zeros((len(xg), len(x)))
K22 = np.zeros((len(xg), len(xg)))
# Fill covariance matrix with values.
for row in xrange(len(x)):
for col in xrange(len(x)):
K11[row, col] = covfun(x[row], x[col], l)
# Do the same for the parts where the grid is involved.
for row in xrange(len(x)):
for col in xrange(len(xg)):
K12[row, col] = covfun(x[row], xg[col], l)
K21[col, row] = covfun(xg[col], x[row], l)
# And the grid itself.
for row in xrange(len(xg)):
for col in xrange(len(xg)):
K22[row, col] = covfun(xg[row], xg[col], l)
# Expected values for grid.
Kinv = np.linalg.pinv(K11 + s * np.eye(len(x)))
yg = np.dot(np.dot(K21, Kinv), y)
covyg = np.diag(K22 - np.dot(np.dot(K21, Kinv), K12))