Covariance Visualization
We have a matrix $A$ with dimension as $m \times n$. Each row is a data sample with a feature vector $f \in \mathbb{R^n}$. We can get a covariance matrix which represents correlation among different features. $$ M_{n \times n} = \begin{bmatrix} (f_1 - \bar{f})^T, (f_2 - \bar{f})^T, \cdots \end{bmatrix}_{n \times m} \begin{bmatrix} f_1 - \bar{f} \\ f_2 - \bar{f} \\ \vdots \end{bmatrix}_{m \times n} $$
Assume data samples follow a Gaussian distribution as: $$ \begin{align} p(x) &\sim \mathcal{N}(0, M)\\ p(x) &= \frac{e^{-\frac{1}{2}x^TM^{-1}x}}{\sqrt{(2\pi)^n\|M\|}} \end{align} $$ With Eigen decomposition $M = Q\Sigma Q^T$, we can get $$ \begin{align} p(x) &= \frac{e^{-\frac{1}{2}x^T(Q \Sigma Q^T)^{-1}x}}{\sqrt{(2\pi)^n\|\Sigma\|}} \\ &= \frac{e^{-\frac{1}{2}x^T Q \Sigma^{-1} Q^Tx}}{\sqrt{(2\pi)^n\|\Sigma\|}} \\ \end{align} $$ Let $y = Q^Tx$, then: $$ \begin{align} p(x) &= \frac{e^{-\frac{1}{2}x^T Q \Sigma^{-1} Q^Tx}}{\sqrt{(2\pi)^n\|\Sigma\|}} \\ &= \frac{e^{-\frac{1}{2}y^T \Sigma^{-1} y}}{\sqrt{(2\pi)^n\|\Sigma\|}} \end{align} $$
$y^T\Sigma^{-1}y$ follows $\chi^2$ distribution. For example, if $y \in \mathbb{R}^2$, and $\Sigma$'s diagonal elements are $\sigma_1$ and $\sigma_2$. If we choose p-value as $0.05$ and degree of freedom as $2$, based on table , we will get $$ y^T\Sigma^{-1}y = 5.99 \\ \frac{y_1^2}{\sigma_1^2} + \frac{y_2^2}{\sigma_2^2} = 5.99 $$
The equation is just an ellipse! But wait, this is an equation for $y$ but our origin data are $x$, what should we do? Remember $y = Q^Tx$? Then we have $x = Qy$. We can get $y$'s frame axes in $x$'s frame are $q_1$ and $q_2$.
# Covariance visualization
import numpy as np
import math
import matplotlib.pyplot as plt
npoints = 1000
mean = [0, 0]
s1 = 9
s2 = 1
cov = [[s1, 0], [0, s2]]
# x^2 / 1 + y^2 / 100 = 5.99
# x^2 / (s1 * 5.99) + y^2 / (s2 * 5.99) = 1
# x = cos(t) / sqrt(s1 * 5.99)
# y = sin(t) / sqrt(s2 * 5.99)
x_o, y_o = np.random.multivariate_normal(mean, cov, npoints).T
#plt.plot(x_o, y_o, 'b.')
#plt.plot([math.cos(t) * math.sqrt(s1*5.99) for t in np.linspace(0, 2*math.pi, 100)],
# [math.sin(t) * math.sqrt(s2*5.99) for t in np.linspace(0, 2*math.pi, 100)],
# 'g-')
#plt.axis('equal')
c = math.cos(math.pi / 4)
s = math.sin(math.pi / 4)
# [c -s]
# [s c]
print('Rotation matrix that transform points in O to P')
T_po = np.array([[c, -s],[s,c]])
print(T_po)
x_p = [e[0]*c-e[1]*s for e in zip(x_o, y_o)]
y_p = [e[0]*s+e[1]*c for e in zip(x_o, y_o)]
plt.plot(x_p, y_p, 'b.', label='Data points')
plt.axis('equal')
M = np.zeros((2,2))
for i, j in zip(x_p, y_p):
M[0, 0] += i*i
M[0, 1] += i*j
M[1, 0] += j*i
M[1, 1] += j*j
S, Q = np.linalg.eig(M / npoints)
print('Rotation matrix that transform points in P to O')
print(Q)
ellipse_x_p = [math.cos(t) * math.sqrt(S[0]*5.99) for t in np.linspace(0, 2*math.pi, 100)]
ellipse_y_p = [math.sin(t) * math.sqrt(S[1]*5.99) for t in np.linspace(0, 2*math.pi, 100)]
ellipse_x_o = [e[0]*Q[0,0]+e[1]*Q[0,1] for e in zip(ellipse_x_p, ellipse_y_p)]
ellipse_y_o = [e[0]*Q[1,0]+e[1]*Q[1,1] for e in zip(ellipse_x_p, ellipse_y_p)]
plt.plot(ellipse_x_o, ellipse_y_o, 'g-', label='Covariance ellipse')
# A*q1
reduced_f_p = [e[0]*Q[0,0] + e[1]*Q[1,0] for e in zip(x_p, y_p)]
# We set other feature dimensions all be 0
fx_o = [e*Q[0,0]+0*Q[0,1] for e in reduced_f_p]
fy_o = [e*Q[1,0]+0*Q[1,1] for e in reduced_f_p]
plt.plot(fx_o, fy_o, 'rx', label='Reduced dimension features')
plt.legend()
plt.show()
#
$\Sigma$ contains the variance information of $y$. If we want to reduce the feature dimension, we may pick first couple of dimensions with big $\sigma$. We still need to transform the data from $x$ to $y$ using $y=Q^Tx$. The matrix form will be $$ \tilde{A}^T = Q^TA^T \\ \tilde{A}^T \approx Q[:, 0:k]^TA^T \\ \tilde{A}_{(m\times k)} \approx AQ[:, 0:k] $$
Another way to view the dimension reduction probably is through this: $$ P(x) = \frac{e^{-\frac{1}{2}x^TQ\Sigma^{-1}Q^Tx}}{\sqrt{((2\pi)^k|\Sigma|}} \\ = \frac{e^{-\frac{1}{2}x^T[q_1, q_2, \cdots, q_n]\Sigma^{-1}[q_1, q_2, \cdots, q_n]^Tx}}{\sqrt{((2\pi)^k|\Sigma|}} $$
We can set $q_{k+1}, \cdots$ and $\sigma_{k+1}, \cdots$ to zeros (similar to using Singular Value Decomposition to approximate a matrix), then we can get the same results.