Gussian Distribution
Gaussian processes ¶
Gaussian processes ¶
Gaussian processes are distributions over functions $f(x)$:
$$f(x) \sim \mathcal{GP}(m(x),k(x,x'))$$For any finite subset $X =\{\mathbf{x}_1 \ldots \mathbf{x}_n \}$ of the domain of $x$, the marginal distribution:
$$f(X) \sim \mathcal{N}(m(X), k(X, X))$$with mean vector $\mathbf{\mu} = m(X)$ and covariance matrix $\Sigma = k(X, X)$.
Gaussian processes for regression ¶
$$ \left[\begin{array}{c} \mathbf{y}_{1} \\ \mathbf{y}_{2} \end{array}\right] \sim \mathcal{N} \left( \left[\begin{array}{c} \mu_{1} \\ \mu_{2} \end{array}\right], \left[ \begin{array}{cc} \Sigma_{11} + \sigma_\epsilon^2I & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{array} \right] \right) $$Where: $$\begin{split} \mu_{1} & = m(X_1) \quad (n_1 \times 1) \\ \mu_{2} & = m(X_2) \quad (n_2 \times 1) \\ \Sigma_{11} & = k(X_1,X_1) \quad (n_1 \times n_1) \\ \Sigma_{22} & = k(X_2,X_2) \quad (n_2 \times n_2) \\ \Sigma_{12} & = k(X_1,X_2) = k_{21}^\top \quad (n_1 \times n_2) \end{split}$$
Note that $\Sigma_{11}$ is independent of $\Sigma_{22}$ and vice versa.
$$\begin{split} p(\mathbf{y}_2 \mid \mathbf{y}_1, X_1, X_2) & = \mathcal{N}(\mu_{2|1}, \Sigma_{2|1}) \\ \mu_{2|1} & = \mu_2 + \Sigma_{21} \Sigma_{11}^{-1} (\mathbf{y}_1 - \mu_1) \\ & = \Sigma_{21} \Sigma_{11}^{-1} \mathbf{y}_1 \quad (\text{if assume mean prior } \mu = 0 ) \\ & = (\Sigma_{11}^{-1} \Sigma_{12})^{\top} \mathbf{y}_1\\ \Sigma_{2|1} & = \Sigma_{22} - \Sigma_{21} \Sigma_{11}^{-1}\Sigma_{12} \\ & = \Sigma_{22} - (\Sigma_{11}^{-1} \Sigma_{12})^{\top} \Sigma_{12} \end{split}$$It is then possible to predict $\mathbf{y}_2$ corresponding to the input samples $X_2$ by using the mean $\mu_{2|1}$ of the resulting distribution as a prediction. Notice that the mean of the posterior predictions $\mu_{2|1}$ of a Gaussian process are weighted averages of the observed variables $\mathbf{y}_1$, where the weighting is based on the covariance function $k$. The variance $\sigma_2^2$ of these predictions is then the diagonal of the covariance matrix $\Sigma_{2|1}$.
Intuitively, the kernel regression estimate at any point x can be thought of as a weighted sum of training labels $y_i$ using the similarity between the corresponding $x_i$ and $x$.
$$ f(x) = \sum_{i=1}^{i=n}{k(x_i, x)[K^{-1}y]_i}$$# Imports
import sys
import numpy as np
import scipy
import scipy.spatial
import matplotlib.pyplot as plt
class GpRegressor():
def __init__(self, X, y, kernel_func, noise):
'''
@param X: Training input (m, n): each row is a input vector
@param y: Training output (m, ): y = f(X) + noise
'''
self.X = X
self.y = y
self.kernel_func = kernel_func
self.noise = noise
# covirance matrix K: mxm
self.m = X.shape[0]
self.n = X.shape[1]
self.K = self.kernel_func(self.X, self.X) + ((self.noise ** 2) * np.eye(self.m))
assert(np.linalg.matrix_rank(self.K) == self.m)
self.invK = np.linalg.inv(self.K)
self.invK_y= self.invK.dot(self.y)
def predict(self, X):
'''
@param X: Training input (m, n): each row is a input vector
@param y: Training output (m, ): y = f(X) + noise
'''
K_21 = self.kernel_func(X, self.X)
return K_21.dot(self.invK_y), self.kernel_func(X, X) - K_21.dot(self.invK.T).dot(K_21.T)
def rbf_kernel(xa, xb,sigma=1):
sq_norm = -0.5 * scipy.spatial.distance.cdist(xa, xb, 'sqeuclidean')/sigma
return np.exp(sq_norm)
#
##### Define the true function that we want to regress on
def ground_height_func(x):
ground_height = 1 - x ** 2 / 25
ground_height[ground_height<0] = 0
return ground_height
n1 = 6 # Number of training points
n2 = 50 # Number of test points
domain = (-6, 6)
# Sample observations (X1, y1) on the function
X1 = np.linspace(domain[0], domain[1], n1).reshape(-1,1)
noise = 0 # The standard deviation of the noise
y1 = ground_height_func(X1) + (noise ** 2) * np.random.randn(n1).reshape(-1, 1)
gp_regressor = GpRegressor(X1, y1, rbf_kernel, noise)
X2 = np.linspace(domain[0], domain[1], n2).reshape(-1,1)
y2, K2 = gp_regressor.predict(X2)
sigma_pred = np.sqrt(np.diag(K2))
# Plot the postior distribution and some samples
fig, ax1 = plt.subplots(
nrows=1, ncols=1, figsize=(6,6), dpi=100)
# Plot the distribution of the function (mean, covariance)
ax1.plot(X2, ground_height_func(X2), 'b--', label='Ground height (GT)')
ax1.fill_between(X2.flat, (y2-sigma_pred.reshape(-1,1)).flat, (y2 + sigma_pred.reshape(-1,1)).flat, color='red',
alpha=0.15, label='$\sigma_{pred|train}$')
ax1.plot(X2, y2, 'r.-', lw=2, label='Ground height (Pred)')
ax1.plot(X1, y1, 'ko', linewidth=2, label='Ground height (Training samples)')
ax1.set_xlabel('$x$', fontsize=13)
ax1.set_ylabel('$y$', fontsize=13)
ax1.set_title('Distribution of posterior and prior data')
ax1.axis([domain[0], domain[1], -3, 3])
ax1.legend()
#