Stefan's Site

Estimating Derivatives for Gaussian Processes

In a previous post I wrote about basic Gaussian Processes. Sometimes it is of interest to also estimate the gradient and maybe have a confidence interval for that as well. For a given covariance function \(k(x_i,x_j)\) and an estimated function \(f(x)\) the following relations should hold:

\[\begin{array}{rcl} \text{cov} \left( f_i, \frac{\partial f_j}{\partial x_j} \right) & = & \frac{\partial k(x_i, x_j)}{\partial x_j} \\ \text{cov} \left( \frac{\partial f_i}{\partial x_i}, f_j \right) & = & \frac{\partial k(x_i, x_j)}{\partial x_i} \\ \text{cov} \left( \frac{\partial f_i}{\partial x_i}, \frac{\partial f_j}{\partial x_j} \right) & = & \frac{\partial^2 k(x_i, x_j)}{\partial x_i \partial x_j} \end{array}\]

The joint distribution for two variables with normal distribution where \(f\) represents the measured values and \(f_*\) contains the values we want to estimate:

\[\left( \begin{array}{c} f \\ f_* \end{array} \right) \sim N \left( \left( \begin{array}{c} 0 \\ 0 \end{array} \right), \left( \begin{array}{cc} K(X,X) + \sigma^2_n I & K(X,X_*) \\ K(X_*,X) & K(X_*,X_*) \end{array} \right) \right)\]

This means that we have:

\[\left( \begin{array}{c} f \\ \frac{\partial f_*}{\partial x_*} \end{array} \right) \sim N \left( \left( \begin{array}{c} 0 \\ \vdots \\ 0 \\ \hline 0 \\ \vdots \\ 0 \end{array} \right), \left( \begin{array}{cccc|cccc} k(x_1,x_1) + \sigma^2_n & k(x_1,x_2) & \cdots & k(x_1,x_n) & \frac{\partial k(x_1,x^*_1)}{\partial x^*_1} & \frac{\partial k(x_1,x^*_2)}{\partial x^*_2} & \cdots & \frac{\partial k(x_1,x^*_n)}{\partial x^*_n} \\ \vdots & \vdots & & \vdots & \vdots & \vdots & & \vdots \\ k(x_n,x_1) & k(x_n,x_2) & \cdots & k(x_n,x_n) + \sigma^2_n & \frac{\partial k(x_n,x^*_1)}{\partial x^*_1} & \frac{\partial k(x_n,x^*_2)}{\partial x^*_2} & \cdots & \frac{\partial k(x_n,x^*_n)}{\partial x^*_n} \\ \hline \frac{\partial k(x^*_1,x_1)}{\partial x^*_1} & \frac{\partial k(x^*_1,x_2)}{\partial x^*_1} & \cdots & \frac{\partial k(x^*_1,x_n)}{\partial x^*_1} & \frac{\partial^2 k(x^*_1,x^*_1)}{\partial {x^*_1}^2} & \frac{\partial^2 k(x^*_1,x^*_2)}{\partial x^*_1 \partial x^*_2} & \cdots & \frac{\partial^2 k(x^*_1,x^*_n)}{\partial x^*_1 \partial x^*_n} \\ \vdots & \vdots & & \vdots & \vdots & \vdots & & \vdots \\ \frac{\partial k(x^*_n,x_1)}{\partial x^*_n} & \frac{\partial k(x^*_n,x_2)}{\partial x^*_n} & \cdots & \frac{\partial k(x^*_n,x_n)}{\partial x^*_n} & \frac{\partial^2 k(x^*_n,x^*_1)}{\partial x^*_n \partial x^*_1} & \frac{\partial^2 k(x^*_n,x^*_2)}{\partial x^*_n \partial x^*_2} & \cdots & \frac{\partial^2 k(x^*_n,x^*_n)}{\partial {x^*_n}^2} \end{array} \right) \right) \]

These expressions are used in the same way when calculating the conditional outcome:

\[\left. \frac{\partial f_*}{\partial X_*} \right| X_*, f, X \sim N \left( \underbrace{ \frac{\partial K(X_*,X)}{\partial X_*} \left( K(X,X) + \sigma^2_n I \right)^{-1}}_{\text{Expected derivative}}, \underbrace{ \frac{\partial^2 K(X_*,X_*)}{\partial X^2_*} - \frac{\partial K(X_*,X)}{\partial X_*} \left( K(X,X) + \sigma^2_n I \right)^{-1} \frac{\partial K(X,X_*)}{\partial X_*} }_{\text{Variance of derivative}} \right)\]

So, to make this work we need to calculate the explicit expressions for the covariance function. Using a quadratic exponential covariance function:

\[k(x_i,x_j) = \exp \left( -\frac{1}{2} \frac{||x_i-x_j||^2_2}{\ell^2} \right) \]

The derivatives with respect to \(x_i\) and \(x_j\) are:

\[\frac{\partial k(x_i,x_j)}{\partial x_i} = - \frac{x_i - x_j}{\ell^2} \left( -\frac{1}{2} \frac{||x_i-x_j||^2_2}{\ell^2} \right) \]

\[ \frac{\partial k(x_i,x_j)}{\partial x_j} = \frac{x_i - x_j}{\ell^2} \left( -\frac{1}{2} \frac{||x_i-x_j||^2_2}{\ell^2} \right) \]

and the second order derivatives

\[ \frac{\partial^2 k(x_i, x_j)}{\partial x^2_i} = \frac{1}{\ell^2} \left( \frac{(x_i-x_j)^2}{\ell^2} - 1 \right) \left( -\frac{1}{2} \frac{||x_i-x_j||^2_2}{\ell^2} \right) \]

\[ \frac{\partial^2 k(x_i, x_j)}{\partial x^2_j} = \frac{1}{\ell^2} \left( \frac{(x_i-x_j)^2}{\ell^2} - 1 \right) \left( -\frac{1}{2} \frac{||x_i-x_j||^2_2}{\ell^2} \right) \]

\[ \frac{\partial^2 k(x_i, x_j)}{\partial x_i \partial x_j} = \frac{1}{\ell^2} \left( 1 - \frac{(x_i-x_j)^2}{\ell^2} \right) \left( -\frac{1}{2} \frac{||x_i-x_j||^2_2}{\ell^2} \right) \]

Combining all of this gives a result shown in the following figure:

Derivatives.

The blue line is the expected function value, the dashed blue line is the 95%-confidence interval. The thick red lines are expected derivatives and the dashed red lines are 95%-confidence intervals for the derivatives. A strange result is that the derivative is very uncertain close to an observation while it is more certain in-between. The uncertainty of the derivatives is highly dependent on the length scale (a longer length scale reduces the uncertainty) and many points in a tight interval also helps.

Python code for creating the plot is given below:

#!/usr/bin/env python2.7

import numpy as np
import matplotlib.pyplot as plt

def covfun(x1, x2, ell):
    k = np.exp(-0.5 * ((x1 - x2) / ell) ** 2.0)
    return k

def dx1_covfun(x1, x2, ell):
    ell2 = ell ** 2.0
    k = (x2 - x1) / ell2 * covfun(x1, x2, ell)
    return k

def dx2_covfun(x1, x2, ell):
    ell2 = ell ** 2.0
    k = (x1 - x2) / ell2 * covfun(x1, x2, ell)
    return k

def ddx1x2_covfun(x1, x2, ell):
    ell2 = ell ** 2.0
    ell4 = ell ** 4.0
    k = ((ell2 - (x1 - x2) ** 2.0) / ell4 * covfun(x1, x2, ell))
    return k

# Characteristic length.

l = np.sqrt(0.025)

# Noise variance.

s = 0.01

# Measured points.

x = np.array([0.1, 0.15, 0.18, 0.2, 0.4, 0.6, 0.8, 0.9])
y = np.array([0.2, 0.3, 0.4, 0.5, 0.7, 0.4, 0.3, 0.2])

# Grid to predict values for.

xg = np.linspace(0.0, 1.0, 61)

# Initialization of covariance matrix.

K11 = np.zeros((len(x), len(x)))
K12 = np.zeros((len(x), len(xg)))
K21 = np.zeros((len(xg), len(x)))
K22 = np.zeros((len(xg), len(xg)))

D11 = K11
D12 = np.zeros((len(x), len(xg)))
D21 = np.zeros((len(xg), len(x)))
D22 = np.zeros((len(xg), len(xg)))

# Fill covariance matrix with values.

for row in xrange(len(x)):
    for col in xrange(len(x)):
        K11[row, col] = covfun(x[row], x[col], l)

# Do the same for the parts where the grid is involved.

for row in xrange(len(x)):
    for col in xrange(len(xg)):
        K12[row, col] = covfun(x[row], xg[col], l)
        K21[col, row] = covfun(xg[col], x[row], l)
        D12[row, col] = dx2_covfun(x[row], xg[col], l)
        D21[col, row] = dx1_covfun(xg[col], x[row], l)

# And the grid itself.

for row in xrange(len(xg)):
    for col in xrange(len(xg)):
        K22[row, col] = covfun(xg[row], xg[col], l)
        D22[row, col] = ddx1x2_covfun(xg[row], xg[col], l)

# Expected values for grid.

Kinv = np.linalg.pinv(K11 + s * np.eye(len(x)))
yg = np.dot(np.dot(K21, Kinv), y)
covyg = np.diag(K22 - np.dot(np.dot(K21, Kinv), K12))

dyg = np.dot(np.dot(D21, Kinv), y)
dcovyg = (D22 - np.dot(np.dot(D21, Kinv), D12))
dcovyg = np.diag(dcovyg)

plt.plot(xg, yg, 'b', linewidth=2)
plt.hold(True)
plt.plot(xg, yg + covyg * 1.96, 'b--')
plt.plot(xg, yg - covyg * 1.96, 'b--')
L = 0.025
for k in xrange(5, len(xg), 5):
    plt.plot(np.array([xg[k] - L, xg[k] + L]),
             np.array([yg[k] - dyg[k] * L,
                      yg[k] + dyg[k] * L]), 'r', linewidth=3)

    plt.plot(np.array([xg[k] - L, xg[k] + L]),
             np.array([yg[k] - (dyg[k] + 1.96 * dcovyg[k]) * L,
                      yg[k] + (dyg[k] + 1.96 * dcovyg[k]) * L]), 'r--')

    plt.plot(np.array([xg[k] - L, xg[k] + L]),
             np.array([yg[k] - (dyg[k] - 1.96 * dcovyg[k]) * L,
                      yg[k] + (dyg[k] - 1.96 * dcovyg[k]) * L]), 'r--')
plt.plot(x, y, 'ok')

plt.axis([0.0, 1.0, 0.0, 1.0])
plt.xlabel('$x$')
plt.ylabel('$f(x)$')
plt.show()