Estimating Derivatives for Gaussian Processes

In a previous post I wrote about basic Gaussian Processes. Sometimes it is of interest to also estimate the gradient and maybe have a confidence interval for that as well. For a given covariance function $$k(x_i,x_j)$$ and an estimated function $$f(x)$$ the following relations should hold:

$\begin{array}{rcl} \text{cov} \left( f_i, \frac{\partial f_j}{\partial x_j} \right) & = & \frac{\partial k(x_i, x_j)}{\partial x_j} \\ \text{cov} \left( \frac{\partial f_i}{\partial x_i}, f_j \right) & = & \frac{\partial k(x_i, x_j)}{\partial x_i} \\ \text{cov} \left( \frac{\partial f_i}{\partial x_i}, \frac{\partial f_j}{\partial x_j} \right) & = & \frac{\partial^2 k(x_i, x_j)}{\partial x_i \partial x_j} \end{array}$

The joint distribution for two variables with normal distribution where $$f$$ represents the measured values and $$f_*$$ contains the values we want to estimate:

$\left( \begin{array}{c} f \\ f_* \end{array} \right) \sim N \left( \left( \begin{array}{c} 0 \\ 0 \end{array} \right), \left( \begin{array}{cc} K(X,X) + \sigma^2_n I & K(X,X_*) \\ K(X_*,X) & K(X_*,X_*) \end{array} \right) \right)$

This means that we have:

$\left( \begin{array}{c} f \\ \frac{\partial f_*}{\partial x_*} \end{array} \right) \sim N \left( \left( \begin{array}{c} 0 \\ \vdots \\ 0 \\ \hline 0 \\ \vdots \\ 0 \end{array} \right), \left( \begin{array}{cccc|cccc} k(x_1,x_1) + \sigma^2_n & k(x_1,x_2) & \cdots & k(x_1,x_n) & \frac{\partial k(x_1,x^*_1)}{\partial x^*_1} & \frac{\partial k(x_1,x^*_2)}{\partial x^*_2} & \cdots & \frac{\partial k(x_1,x^*_n)}{\partial x^*_n} \\ \vdots & \vdots & & \vdots & \vdots & \vdots & & \vdots \\ k(x_n,x_1) & k(x_n,x_2) & \cdots & k(x_n,x_n) + \sigma^2_n & \frac{\partial k(x_n,x^*_1)}{\partial x^*_1} & \frac{\partial k(x_n,x^*_2)}{\partial x^*_2} & \cdots & \frac{\partial k(x_n,x^*_n)}{\partial x^*_n} \\ \hline \frac{\partial k(x^*_1,x_1)}{\partial x^*_1} & \frac{\partial k(x^*_1,x_2)}{\partial x^*_1} & \cdots & \frac{\partial k(x^*_1,x_n)}{\partial x^*_1} & \frac{\partial^2 k(x^*_1,x^*_1)}{\partial {x^*_1}^2} & \frac{\partial^2 k(x^*_1,x^*_2)}{\partial x^*_1 \partial x^*_2} & \cdots & \frac{\partial^2 k(x^*_1,x^*_n)}{\partial x^*_1 \partial x^*_n} \\ \vdots & \vdots & & \vdots & \vdots & \vdots & & \vdots \\ \frac{\partial k(x^*_n,x_1)}{\partial x^*_n} & \frac{\partial k(x^*_n,x_2)}{\partial x^*_n} & \cdots & \frac{\partial k(x^*_n,x_n)}{\partial x^*_n} & \frac{\partial^2 k(x^*_n,x^*_1)}{\partial x^*_n \partial x^*_1} & \frac{\partial^2 k(x^*_n,x^*_2)}{\partial x^*_n \partial x^*_2} & \cdots & \frac{\partial^2 k(x^*_n,x^*_n)}{\partial {x^*_n}^2} \end{array} \right) \right)$

These expressions are used in the same way when calculating the conditional outcome:

$\left. \frac{\partial f_*}{\partial X_*} \right| X_*, f, X \sim N \left( \underbrace{ \frac{\partial K(X_*,X)}{\partial X_*} \left( K(X,X) + \sigma^2_n I \right)^{-1}}_{\text{Expected derivative}}, \underbrace{ \frac{\partial^2 K(X_*,X_*)}{\partial X^2_*} - \frac{\partial K(X_*,X)}{\partial X_*} \left( K(X,X) + \sigma^2_n I \right)^{-1} \frac{\partial K(X,X_*)}{\partial X_*} }_{\text{Variance of derivative}} \right)$

So, to make this work we need to calculate the explicit expressions for the covariance function. Using a quadratic exponential covariance function:

$k(x_i,x_j) = \exp \left( -\frac{1}{2} \frac{||x_i-x_j||^2_2}{\ell^2} \right)$

The derivatives with respect to $$x_i$$ and $$x_j$$ are:

$\frac{\partial k(x_i,x_j)}{\partial x_i} = - \frac{x_i - x_j}{\ell^2} \left( -\frac{1}{2} \frac{||x_i-x_j||^2_2}{\ell^2} \right)$

$\frac{\partial k(x_i,x_j)}{\partial x_j} = \frac{x_i - x_j}{\ell^2} \left( -\frac{1}{2} \frac{||x_i-x_j||^2_2}{\ell^2} \right)$

and the second order derivatives

$\frac{\partial^2 k(x_i, x_j)}{\partial x^2_i} = \frac{1}{\ell^2} \left( \frac{(x_i-x_j)^2}{\ell^2} - 1 \right) \left( -\frac{1}{2} \frac{||x_i-x_j||^2_2}{\ell^2} \right)$

$\frac{\partial^2 k(x_i, x_j)}{\partial x^2_j} = \frac{1}{\ell^2} \left( \frac{(x_i-x_j)^2}{\ell^2} - 1 \right) \left( -\frac{1}{2} \frac{||x_i-x_j||^2_2}{\ell^2} \right)$

$\frac{\partial^2 k(x_i, x_j)}{\partial x_i \partial x_j} = \frac{1}{\ell^2} \left( 1 - \frac{(x_i-x_j)^2}{\ell^2} \right) \left( -\frac{1}{2} \frac{||x_i-x_j||^2_2}{\ell^2} \right)$

Combining all of this gives a result shown in the following figure:

The blue line is the expected function value, the dashed blue line is the 95%-confidence interval. The thick red lines are expected derivatives and the dashed red lines are 95%-confidence intervals for the derivatives. A strange result is that the derivative is very uncertain close to an observation while it is more certain in-between. The uncertainty of the derivatives is highly dependent on the length scale (a longer length scale reduces the uncertainty) and many points in a tight interval also helps.

Python code for creating the plot is given below:

#!/usr/bin/env python2.7
import numpy as np
import matplotlib.pyplot as plt

def covfun(x1, x2, ell):
k = np.exp(-0.5 * ((x1 - x2) / ell) ** 2.0)
return k

def dx1_covfun(x1, x2, ell):
ell2 = ell ** 2.0
k = (x2 - x1) / ell2 * covfun(x1, x2, ell)
return k

def dx2_covfun(x1, x2, ell):
ell2 = ell ** 2.0
k = (x1 - x2) / ell2 * covfun(x1, x2, ell)
return k

def ddx1x2_covfun(x1, x2, ell):
ell2 = ell ** 2.0
ell4 = ell ** 4.0
k = ((ell2 - (x1 - x2) ** 2.0) / ell4 * covfun(x1, x2, ell))
return k

# Characteristic length.
l = np.sqrt(0.025)

# Noise variance.
s = 0.01

# Measured points.
x = np.array([0.1, 0.15, 0.18, 0.2, 0.4, 0.6, 0.8, 0.9])
y = np.array([0.2, 0.3, 0.4, 0.5, 0.7, 0.4, 0.3, 0.2])

# Grid to predict values for.
xg = np.linspace(0.0, 1.0, 61)

# Initialization of covariance matrix.
K11 = np.zeros((len(x), len(x)))
K12 = np.zeros((len(x), len(xg)))
K21 = np.zeros((len(xg), len(x)))
K22 = np.zeros((len(xg), len(xg)))

D11 = K11
D12 = np.zeros((len(x), len(xg)))
D21 = np.zeros((len(xg), len(x)))
D22 = np.zeros((len(xg), len(xg)))

# Fill covariance matrix with values.
for row in xrange(len(x)):
for col in xrange(len(x)):
K11[row, col] = covfun(x[row], x[col], l)

# Do the same for the parts where the grid is involved.
for row in xrange(len(x)):
for col in xrange(len(xg)):
K12[row, col] = covfun(x[row], xg[col], l)
K21[col, row] = covfun(xg[col], x[row], l)
D12[row, col] = dx2_covfun(x[row], xg[col], l)
D21[col, row] = dx1_covfun(xg[col], x[row], l)

# And the grid itself.
for row in xrange(len(xg)):
for col in xrange(len(xg)):
K22[row, col] = covfun(xg[row], xg[col], l)
D22[row, col] = ddx1x2_covfun(xg[row], xg[col], l)

# Expected values for grid.
Kinv = np.linalg.pinv(K11 + s * np.eye(len(x)))
yg = np.dot(np.dot(K21, Kinv), y)
covyg = np.diag(K22 - np.dot(np.dot(K21, Kinv), K12))

dyg = np.dot(np.dot(D21, Kinv), y)
dcovyg = (D22 - np.dot(np.dot(D21, Kinv), D12))
dcovyg = np.diag(dcovyg)

plt.plot(xg, yg, 'b', linewidth=2)
plt.hold(True)
plt.plot(xg, yg + covyg * 1.96, 'b--')
plt.plot(xg, yg - covyg * 1.96, 'b--')
L = 0.025
for k in xrange(5, len(xg), 5):
plt.plot(np.array([xg[k] - L, xg[k] + L]),
np.array([yg[k] - dyg[k] * L,
yg[k] + dyg[k] * L]), 'r', linewidth=3)

plt.plot(np.array([xg[k] - L, xg[k] + L]),
np.array([yg[k] - (dyg[k] + 1.96 * dcovyg[k]) * L,
yg[k] + (dyg[k] + 1.96 * dcovyg[k]) * L]), 'r--')

plt.plot(np.array([xg[k] - L, xg[k] + L]),
np.array([yg[k] - (dyg[k] - 1.96 * dcovyg[k]) * L,
yg[k] + (dyg[k] - 1.96 * dcovyg[k]) * L]), 'r--')
plt.plot(x, y, 'ok')

plt.axis([0.0, 1.0, 0.0, 1.0])
plt.xlabel('$x$')
plt.ylabel('$f(x)$')
plt.show()