Mixture Models for Least Square Optimization

Problem Formulation ¶
Factor Potentials as Gaussian ¶
Factor Potentials as Gaussian Mixtures ¶
Robust Kernel with Gaussian Mixture ¶
Expectation Maximization for GMM based Factor Graph ¶
Self-supervise Learned Robust Kernel ¶

Problem Formulation ¶

The goal is to infer the posterior distribution of the state variable $x$ conditioned on sensor observations z $$ x^* = \underset{x}{\operatorname{argmax}} p (x | z) $$ By applying Bayes' rule and assuming uninformative prior $p(x)$, maximizing posterior distribution is same as maximizing likelihood distribution: $$ x^* = \underset{x}{\operatorname{argmax}} p (x | z) = x^* = \underset{x}{\operatorname{argmax}} p (z | x) $$ If we assume all observations $z$ are independent given some related state variable $x$, we can get: $$ x^* = \underset{x}{\operatorname{argmax}} \prod p (z_i | x) $$

Factor Potentials as Gaussian ¶

If we assume $p(z_i | x) \sim N (f_i(x)-z_i, \Sigma_i)$, then we can maximize the log-likelihood as: $$ x^* = \underset{x}{\operatorname{argmax}} \sum_i{log(\eta_i e^{-\frac{1}{2}(f_i(x)-z_i)^T\Sigma_i^{-1}(f_i(x)-z_i)})} $$ It then turns into a least square problem: $$ x^* = \underset{x}{\operatorname{argmin}} \sum_i{(f_i(x)-z_i)^T\Sigma_i^{-1}(f_i(x)-z_i)} = \underset{x}{\operatorname{argmin}} \sum_i{\|f_i(x) - z_i\|_{\Sigma_i}} $$ For the sack of simplicity, we assume $f_i$ are linear functions (We can always linearize it as $f_i(x) \approx J_i(x-x_0) + f_x(x_0)$). Then this becomes a linear least square problem, and the cost function is quadratic and convex.

A Simple Linear Regression ¶

Consider a very simple linear regression problem. In order to visualize the loss vs parameter in a 2D plot, we parameterize the line as $y = ax$. If we assume $\Sigma_i = I$, then $$ a^* = \sum_{i}(ax_i - y_i)^2 $$ As shown in the figure below, the cost function is clearly a quadratic function.

In [48]:

# A Simple Linear Regression

import numpy as np
import matplotlib.pyplot as plt

a = 1
x = np.linspace(-3, 3, 100)
y = a * x 
y_hat = y + np.random.normal(0, 0.1, x.shape)

a_hat = np.linspace(-1, 3, 1000)
losses = [np.sum((t*x-y_hat)**2)/x.shape[0] for t in a_hat]

a_best = a_hat[np.argmin(losses)]
print(f'Best $a$ with least loss is {a_best:.2f}')

fig, (ax1, ax2) = plt.subplots(
    nrows=2, ncols=1, figsize=(6,6), dpi=100)

ax1.plot(x, y, '-b', label='ground truth line y = x')
ax1.plot(x, a_best*x, '-g', label=f'predicted line y = {a_best:.2f}x')
ax1.plot(x, y_hat, 'r.', label='noisy observations')
ax1.set_ylabel('y')
ax1.set_xlabel('x')
ax1.legend()

ax2.plot(a_hat, losses, '-g', label='loss function')
ax2.set_ylabel('squared sum loss')
ax2.set_xlabel('slope a')
ax2.legend()

#

Best $a$ with least loss is 1.00

Out[48]:

<matplotlib.legend.Legend at 0x13799cfa0>

A Simple Linear Regression With Outliers ¶

However, the approach described above is very sensitive to outliers. Let's add a couple of outliers to the measurements. As you can see, the estimation is off. This is because these outliers contribute more errors to the cost function. The optimizer tries to reduce these errors.

In [49]:

# A Simple Linear Regression with outliers

import matplotlib.pyplot as plt
import numpy as np
from numpy.random import default_rng

npoints = 100
a = 1
x = np.linspace(-3, 3, npoints)
y = a * x 
y_hat = y + np.random.normal(0, 0.1, x.shape)

n_outliers = 20
rng = np.random.default_rng()
outlier_indices = rng.choice(npoints, size=n_outliers, replace=False)
# In most practical applications, residuals are not Gaussian distributed.
y_hat[outlier_indices] = np.random.uniform(-3, 3, n_outliers)

a_hat = np.linspace(-1, 3, 1000)
losses = [np.sum((t*x-y_hat)**2)/x.shape[0] for t in a_hat]

a_best = a_hat[np.argmin(losses)]
print(f'Best $a$ with least loss is {a_best:.2f}')

fig, (ax1, ax2) = plt.subplots(
    nrows=2, ncols=1, figsize=(6,6), dpi=100)

ax1.plot(x, y, '-b', label='ground truth line y = x')
ax1.plot(x, a_best*x, '-g', label=f'predicted line y = {a_best:.2f}x')
ax1.plot(x, y_hat, 'r.', label='noisy observations')
ax1.plot(x[outlier_indices], y_hat[outlier_indices], 'k*', label='outliers')
ax1.set_ylabel('y')
ax1.set_xlabel('x')
ax1.legend()

ax2.plot(a_hat, losses, '-g', label='loss function')
ax2.set_ylabel('squared sum loss')
ax2.set_xlabel('slope a')
ax2.legend()

#

Best $a$ with least loss is 0.73

Out[49]:

<matplotlib.legend.Legend at 0x137a51e50>

Robust Kernels ¶

How do we make the optimization more robust to outliers?

Detect the outliers before running optimization so that we never use them in the optimization.
Give less weights for outliers in the optimization. Outliers usually have large errors (residual) compared to normal data assume that the initial guesses of the state variables are not very off.

For (2), we can replace the $L_2$ norm with robust kernels (M-estimators): $$ x^* = \underset{x}{\operatorname{argmin}} \sum_i{\rho(r_i(x))} $$ where $r_i(x) = |f_i(x) - z_i|{\Sigma_i}^{-\frac{1}{2}}$, $\rho()$ is a function designed to reduce the influence of large residuals.

Huber Loss ¶

One common robust kernel is Huber loss: $$ \rho(r) = \begin{cases} \frac{r^2}{2} &\text{$|r| < k$}\\ k(|r|-\frac{k}{2}) &\text{$|r| \geq k$} \end{cases} $$

$$ \frac{\partial L}{\partial x} = \sum_i{\frac{\partial \rho(r_i)}{\partial r_i}\frac{\partial r_i(x)}{\partial x_i}} $$

We define $\phi = \frac{\partial \rho(r_i)}{\partial r_i}$, then $$ \frac{\partial L}{\partial x} = \sum_i{\frac{\phi}{r_i}r_i\frac{\partial r_i(x)}{\partial x_i}} $$

We define $w = \frac{\phi}{r_i}$, then it turns out that we can formulate loss L as: $$ L = \sum_i{w(r_i(x_c))r^2_i(x)} $$ where, $x_c$ is the current most recent estimate of $x$, aka linearization point.

For huber loss, the weight function is: $$ w(r) = \begin{cases} 1 &\text{$|r| < k$}\\ \frac{k}{|r|} &\text{$|r| \geq k$} \end{cases} $$

By setting the weight function as $w = \frac{1}{r_i}\frac{\partial \rho(r_i)}{\partial r_i} |_{r_i = r_i(x_c)}$, we can solve the the optimization problem by using the existing techniques for weighted least-squares such as Gauss-Newton, Levenberg-Marquardt, etc.

In [50]:

# Robust Kernel Functions

import matplotlib.pyplot as plt
import numpy as np

def huber_loss(r, k):
    y = np.zeros(r.shape)
    y[np.abs(r) < k] = 0.5 * (r[np.abs(r) < k] ** 2)
    y[np.abs(r) >= k] = k * (np.abs(r[np.abs(r) >= k]) - 0.5*k)
    return y

def huber_weight(r, k):
    y = np.ones(r.shape)
    y[np.abs(r) >= k] = k / np.abs(r[np.abs(r) >= k])
    return y
    
r = np.linspace(-5, 5, 100)

fig, (ax1, ax2) = plt.subplots(
    nrows=2, ncols=1, figsize=(6,6), dpi=100)

ax1.plot(r, 0.5 * r**2, '-r', label='$L_2$ loss')
ax1.plot(r, huber_loss(r,1), '-g', label='huber loss k=1')
ax1.plot(r, huber_loss(r,2), '-b', label='huber loss k=2')
ax1.set_ylabel('loss')
ax1.set_xlabel('r')
ax1.legend()

ax2.plot(r, np.ones(r.shape), '-r', label='$L_2$ weight')
ax2.plot(r, huber_weight(r,1), '-g', label='huber weight k=1')
ax2.plot(r, huber_weight(r,2), '-b', label='huber weight k=2')
ax2.set_ylabel('weight')
ax2.set_xlabel('r')
ax2.legend()

#

Out[50]:

<matplotlib.legend.Legend at 0x1376028b0>

Now, we can apply Huber loss to the linear regression problem.

In [51]:

# A Robust Kernel based Linear Regression with outliers

import matplotlib.pyplot as plt
import numpy as np
from numpy.random import default_rng

def huber_loss(r, k):
    y = np.zeros(r.shape)
    y[np.abs(r) < k] = 0.5 * (r[np.abs(r) < k] ** 2)
    y[np.abs(r) >= k] = k * (np.abs(r[np.abs(r) >= k]) - 0.5*k)
    return y

def huber_weight(r, k):
    y = np.ones(r.shape)
    y[np.abs(r) >= k] = k / np.abs(r[np.abs(r) >= k])
    return y

npoints = 100
a = 1
x = np.linspace(-3, 3, npoints)
y = a * x 
y_hat = y + np.random.normal(0, 0.1, x.shape)

n_outliers = 20
rng = np.random.default_rng()
outlier_indices = rng.choice(npoints, size=n_outliers, replace=False)
# In most practical applications, residuals are not Gaussian distributed.
y_hat[outlier_indices] += np.random.uniform(-3, 3, n_outliers)

a_hat = np.linspace(-1, 3, 1000)
huber_k = 0.1
losses = [np.sum(huber_loss(t*x-y_hat,huber_k)) / x.shape[0] for t in a_hat]

a_best = a_hat[np.argmin(losses)]
print(f'Best $a$ with least loss is {a_best:.2f}')

fig, (ax1, ax2) = plt.subplots(
    nrows=2, ncols=1, figsize=(6,6), dpi=100)

ax1.plot(x, y, '-b', label='ground truth line y = x')
ax1.plot(x, a_best*x, '-g', label=f'predicted line y = {a_best:.2f}x')
ax1.plot(x, y_hat, 'r.', label='noisy observations')
ax1.plot(x[outlier_indices], y_hat[outlier_indices], 'k*', label='outliers')
ax1.set_ylabel('y')
ax1.set_xlabel('x')
ax1.legend()

ax2.plot(a_hat, losses, '-g', label='loss function')
ax2.set_ylabel('squared sum loss')
ax2.set_xlabel('slope a')
ax2.legend()

#

Best $a$ with least loss is 1.00

Out[51]:

<matplotlib.legend.Legend at 0x137bda580>

A Generalized Robust Kernel ¶

A single generalized robust kernel can be express as:

$$ \rho(r, \alpha, k) = \frac{|\alpha-2|}{\alpha}((\frac{(\frac{|r|}{k})^2}{|\alpha-2|}+1)^{\frac{\alpha}{2}}-1) $$

where $\alpha$ is a real-valued parameter that controls the shape of the kernel and $k > 0$ is the scale parameter that determines the size of quadratic loss region around $r = 0$.

With removing singularities:

$$ \rho(r, \alpha, k) = \begin{cases} \frac{1}{2}(\frac{|r|}{k})^2 &\text{$\alpha = 2$}\\ log(\frac{1}{2}(\frac{|r|}{k})^2+1) &\text{$\alpha = 0$}\\ 1 - e^{-\frac{1}{2}(\frac{|r|}{k})^2} &\text{$\alpha = -\infty$}\\ \frac{|\alpha-2|}{\alpha}((\frac{(\frac{|r|}{k})^2}{|\alpha-2|}+1)^{\frac{\alpha}{2}}-1) &\text{otherwise} \end{cases} $$

Then weight function $w$ is: $$ w(r, \alpha, k) = \begin{cases} \frac{1}{k^2} &\text{$\alpha = 2$}\\ \frac{2}{r^2+2k^2} &\text{$\alpha = 0$}\\ \frac{1}{k^2} e^{-\frac{1}{2}(\frac{|r|}{k})^2} &\text{$\alpha = -\infty$}\\ \frac{1}{k^2}(\frac{(\frac{|r|}{k})^2}{|\alpha-2|}+1)^{\frac{\alpha}{2}-1} &\text{otherwise} \end{cases} $$

In [56]:

# A Generalized Robust Kernel Functions

import matplotlib.pyplot as plt
import numpy as np


def huber_loss(r, k):
    y = np.zeros(r.shape)
    y[np.abs(r) < k] = 0.5 * (r[np.abs(r) < k] ** 2)
    y[np.abs(r) >= k] = k * (np.abs(r[np.abs(r) >= k]) - 0.5*k)
    return y

def huber_weight(r, k):
    y = np.ones(r.shape)
    y[np.abs(r) >= k] = k / np.abs(r[np.abs(r) >= k])
    return y

def robust_loss(r, a, k):
    assert(a != 0)
    if a == 2:
        return 0.5 * (r/k)**2
    return np.abs(a-2) / a * (((r/k)**2/np.abs(a-2)+1)**(a/2.0)-1)

def robust_weight(r, a, k):
    assert(a != 0)
    if a == 2:
        return np.ones(r.shape, dtype=np.float32) / (k**2)
    return ((r/k)**2/np.abs(a-2)+1)**(0.5*a-1) / (k**2)
    
r = np.linspace(-5, 5, 100)

fig, (ax1, ax2) = plt.subplots(
    nrows=2, ncols=1, figsize=(6,6), dpi=100)

ax1.plot(r, 0.5 * r**2, '-r', label='$L_2$ loss')
ax1.plot(r, robust_loss(r,2,1), '-.y', label='robust loss a=2, k=1')
ax1.plot(r, huber_loss(r,1), '-g', label='huber loss k=1')
ax1.plot(r, robust_loss(r,1,1), '-.g', label='robust loss a=1, k=1')
ax1.plot(r, huber_loss(r,2), '-b', label='huber loss k=2')
ax1.plot(r, robust_loss(r,1,2), '-.b', label='robust loss a=1, k=2')
ax1.plot(r, robust_loss(r,-10,2), '-.r', label='robust loss a=-20, k=2')

ax1.set_ylabel('loss')
ax1.set_xlabel('r')
ax1.legend()

ax2.plot(r, np.ones(r.shape), '-r', label='$L_2$ weight')
ax2.plot(r, robust_weight(r, 2, 1), '-.y', label='robust weight a=2, k=1')
ax2.plot(r, huber_weight(r,1), '-g', label='huber weight k=1')
ax2.plot(r, robust_weight(r, 1, 1), '-.g', label='robust weight a=1, k=1')
ax2.plot(r, huber_weight(r,2), '-b', label='huber weight k=2')
ax2.plot(r, robust_weight(r,1,2), '-.b', label='robust weight a=1, k=2')
ax2.plot(r, robust_weight(r, -20, 1), '-.r', label='robust weight a=-20, k=2')
ax2.set_ylabel('weight')
ax2.set_xlabel('r')
ax2.legend()

#

Out[56]:

<matplotlib.legend.Legend at 0x1445a3d00>

Construct Probability Distributions from Robust Kernels ¶

Recall that we are trying to maximize the log-likelihood: $$ x^* = \underset{x}{\operatorname{argmax}} \sum_i log(p(z_i|x)) $$ In this section, we are trying to minimize the robust cost: $$ x^* = \underset{x}{\operatorname{argmin}} \sum_i{\rho(r_i(x))} $$

By connecting these two together, we can get: $$ p(z_i | x) = p(r_i(x, z_i)) = \frac{1}{kZ(\alpha)}e^{-\rho(r_i, \alpha, k)} $$ where $Z(\alpha) = \int{e^{-\rho(r_i, \alpha, 1)}}$

When optimizing hyperparameter $\alpha$ and state variable $x$ together, there are a couple of things to note:

$Z(\alpha)$ is unbounded for $\alpha < 0$. So a solution will be cut off integral to be in $[-\tau, \tau]$. Then, the probability distribution is not well normalized. But for maximizing log-likelihood purpose, it should work fine.
$x^*, \alpha^* = \underset{x, \alpha}{\operatorname{argmax}} \sum_i log(p_{\alpha}(z_i|x)) = \underset{x, \alpha}{\operatorname{argmin}}\sum_i\rho(r_i,\alpha,k) + log(kZ(\alpha))$. Be aware that the final cost function has a regularization term.

Factor Potentials as Gaussian Mixtures ¶

In the previous section, we assume the likelihood $p(z|x)$ follows Gaussian distribution. As we pointed out earlier, in practice, residuals sometime are not Gaussian distributed. This is exactly why we come up with robust kernels. However, using these robust kernels still assumes that residuals follows a symmetric distribution. Now let's approach the problem in a different way which assumes that residuals follow a Gaussian mixture model:

$$ p(z_i | x) \sim \sum_kw^i_kN(\mu^i_k, \Sigma^i_k) $$

As shown in the figure below, the mixture model (red) is not symmetric anymore. It can represent arbitrary distribution given proper components. But this also brings a difficulty in optimization. Remember we are trying to maximize log-likelihood and we can convert it into a least square problem with the assumption that residuals follow a single Gaussian distribution. But without Gaussian mixture, we can we do? $$ x^* = \underset{x}{\operatorname{argmax}} \prod p (z_i | x) = \underset{x}{\operatorname{argmax}} \sum_i log(\sum_kw^i_kN(\mu^i_k, \Sigma^i_k)) $$

In [55]:

# Gaussian Mixture Model

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm

x = np.linspace(-6, 6, 100)
w1 = 0.5
u1 = -2
s1 = 1

w2 = 0.5
u2 = 1
s2 = 2

p1 = norm.pdf(x, u1, s1)
p2 = norm.pdf(x, u2, s2)

fig, (ax1) = plt.subplots(
    nrows=1, ncols=1, figsize=(6,6), dpi=100)

ax1.plot(x, p1, 'g', label='Component p1')
ax1.plot(x, p2, 'b', label='Component p2')
ax1.plot(x, w1*p1+w2*p2, 'r', label='Mixture model')

ax1.set_ylabel('pdf')
ax1.set_xlabel('x')
ax1.legend()

#

Out[55]:

<matplotlib.legend.Legend at 0x144403760>

Max-Mixture Model ¶

In order to convert the maximization of log-likelihood with Gaussian mixture into a nice least square problem, we try to use the max component in the mixtures to approximate it:

$$ p(z_i | x) \sim \sum_kw^i_kN(\mu^i_k, \Sigma^i_k) \sim \frac{\underset{k}{\operatorname{max}}\sum_kw^i_kN(\mu^i_k, \Sigma^i_k)}{\gamma^i} $$

Note that a normalization factor $\gamma$ is required in order to ensure that a max mixture integrates to $1$. However, for maximizing log-likelihood, it is just added as a constant ($-log(\gamma)$). So it won't affect the final solution of $x^*$:

$$ x^* = \underset{x}{\operatorname{argmax}} \sum_i{log(\frac{w^i_{k^*}}{\gamma^i}\eta_i e^{-\frac{1}{2}(\mu^i_{k^*})^T{\Sigma^i_{k^*}}^{-1}(\mu^i_{k^*})})} = \underset{x}{\operatorname{argmax}} \sum_i{log(e^{-\frac{1}{2}(\mu^i_{k^*})^T{\Sigma^i_{k^*}}^{-1}(\mu^i_{k^*})})} $$

I general max-mixture is a very good approximation to sum-mixture. I think a good practice is that when you define mixture model of the residuals, check how close the max-mixture matches the sum-mixture.

In [86]:

# Gaussian Mixture Model

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm

x = np.linspace(-6, 6, 100)
fig, (ax1, ax2, ax3) = plt.subplots(
    nrows=3, ncols=1, figsize=(6,6), dpi=100)

w1 = 0.5
u1 = -2
s1 = 1

w2 = 1-w1
u2 = 2
s2 = 1

p1 = w1*norm.pdf(x, u1, s1)
p2 = w2*norm.pdf(x, u2, s2)

ax1.plot(x, p1, 'g', label='Component p1')
ax1.plot(x, p2, 'b', label='Component p2')
ax1.plot(x, p1+p2, 'r', label='Mixture model')
ax1.plot(x, [pa if pa > pb else pb for (pa, pb) in zip(p1, p2)], 'y-.', label='unnormalized Max mixture model')
ax1.set_ylabel('pdf')
ax1.set_xlabel('x')
ax1.legend()

w1 = 0.4
u1 = -2
s1 = 3

w2 = 1-w1
u2 = 2
s2 = 3

p1 = w1*norm.pdf(x, u1, s1)
p2 = w2*norm.pdf(x, u2, s2)

ax2.plot(x, p1, 'g', label='Component p1')
ax2.plot(x, p2, 'b', label='Component p2')
ax2.plot(x, p1+p2, 'r', label='Mixture model')
ax2.plot(x, [pa if pa > pb else pb for (pa, pb) in zip(p1, p2)], 'y-.', label='unnormalized Max mixture model')
ax2.set_ylabel('pdf')
ax2.set_xlabel('x')
ax2.legend()

w1 = 0.5
u1 = 0
s1 = .1

w2 = 1-w1
u2 = 0
s2 = 1

p1 = w1*norm.pdf(x, u1, s1)
p2 = w2*norm.pdf(x, u2, s2)

ax3.plot(x, p1, 'g', label='Component p1')
ax3.plot(x, p2, 'b', label='Component p2')
ax3.plot(x, p1+p2, 'r', label='Mixture model')
ax3.plot(x, [pa if pa > pb else pb for (pa, pb) in zip(p1, p2)], 'y-.', label='unnormalized Max mixture model')
ax3.set_ylabel('pdf')
ax3.set_xlabel('x')
ax3.legend()

#

Out[86]:

<matplotlib.legend.Legend at 0x145e67a60>

In [160]:

# Use Max mixturre model to reject outliers

import matplotlib.pyplot as plt
import numpy as np
from numpy.random import default_rng
from scipy.stats import norm

npoints = 100
a = 1
x = np.linspace(-3, 3, npoints)
y = a * x 
y_hat = y + np.random.normal(0, 0.1, x.shape)

# Number of outliers. Try to play with it.
n_outliers = 30
rng = np.random.default_rng()
outlier_indices = rng.choice(npoints, size=n_outliers, replace=False)
# In most practical applications, residuals are not Gaussian distributed.
y_hat[outlier_indices] = np.random.uniform(-3, 3, n_outliers)

# Suppose residuals follow Gaussian mixture distribution and we have two mixtures.
# Both mixtures have the same mean but different sigmas.
w1 = 0.6
s1 = 0.1
w2 = 1-w1
s2 = 1

a_hat = np.linspace(-1, 3, 1000)
losses = []
for t in a_hat:
    residuals = t*x-y_hat
    # TODO: We should check log-likelihood to avoid numerical issue.
    p1s = w1 * norm.pdf(residuals, loc=0, scale=s1)
    p2s = w2 * norm.pdf(residuals, loc=0, scale=s2)
    loss = 0
    for p1, p2, residual in zip(p1s, p2s, residuals):
        if p1 > p2:
            loss -= np.log(p1)
        else: 
            loss -= np.log(p2)
    loss /= x.shape[0]
    losses.append(loss)

a_best = a_hat[np.argmin(losses)]
print(f'Best $a$ with least loss is {a_best:.2f}')

fig, (ax1, ax2) = plt.subplots(
    nrows=2, ncols=1, figsize=(6,6), dpi=100)

ax1.plot(x, y, '-b', label='ground truth line y = x')
ax1.plot(x, a_best*x, '-g', label=f'predicted line y = {a_best:.2f}x')
ax1.plot(x, y_hat, 'r.', label='noisy observations')
ax1.plot(x[outlier_indices], y_hat[outlier_indices], 'k*', label='outliers')
ax1.set_ylabel('y')
ax1.set_xlabel('x')
ax1.legend()

ax2.plot(a_hat, losses, '-g', label='loss function')
ax2.set_ylabel('squared sum loss')
ax2.set_xlabel('slope a')
ax2.legend()

#

Best $a$ with least loss is 1.00

Out[160]:

<matplotlib.legend.Legend at 0x1495a53d0>

Sum-Mixture Model ¶

When using sum-mixture model, it is clear that we can't convert it into a nice least square problem. In other word, the logarithm can no longer be pushed all the way into the individual Gaussian components - the summation in prevents it. $$ x^* = \underset{x}{\operatorname{argmax}} \sum_i\sum_k{log(w_k^i\eta_k^i e^{-\frac{1}{2}(x-\mu_k^i)^T{\Sigma^i_{k^*}}^{-1}(x-\mu_k^i)})} $$

In [153]:

# Use Sum mixture model to reject outliers. Optimization is done by grid search.

import matplotlib.pyplot as plt
import numpy as np
from numpy.random import default_rng
from scipy.stats import norm

npoints = 100
a = 1
x = np.linspace(-3, 3, npoints)
y = a * x 
y_hat = y + np.random.normal(0, 0.1, x.shape)

n_outliers = 30
rng = np.random.default_rng()
outlier_indices = rng.choice(npoints, size=n_outliers, replace=False)
# In most practical applications, residuals are not Gaussian distributed.
y_hat[outlier_indices] = np.random.uniform(-3, 3, n_outliers)


# Suppose residuals follow Gaussian mixture distribution and we have two mixtures.
# Both mixtures have the same mean as 0 but different sigmas.
w1 = 0.6
s1 = 0.1
w2 = 1-w1
s2 = 1

a_hat = np.linspace(-1, 3, 1000)
losses = []
for t in a_hat:
    residuals = t*x-y_hat
    # TODO: We should check log-likelihood to avoid numerical issue.
    p1s = w1 * norm.pdf(residuals, loc=0, scale=s1)
    p2s = w2 * norm.pdf(residuals, loc=0, scale=s2)
    loss = 0
    for p1, p2, residual in zip(p1s, p2s, residuals):
        loss -= np.log(p1 + p2)
    loss /= x.shape[0]
    losses.append(loss)

a_best = a_hat[np.argmin(losses)]
print(f'Best $a$ with least loss is {a_best:.2f}')

fig, (ax1, ax2) = plt.subplots(
    nrows=2, ncols=1, figsize=(6,6), dpi=100)

ax1.plot(x, y, '-b', label='ground truth line y = x')
ax1.plot(x, a_best*x, '-g', label=f'predicted line y = {a_best:.2f}x')
ax1.plot(x, y_hat, 'r.', label='noisy observations')
ax1.plot(x[outlier_indices], y_hat[outlier_indices], 'k*', label='outliers')
ax1.set_ylabel('y')
ax1.set_xlabel('x')
ax1.legend()

ax2.plot(a_hat, losses, '-g', label='loss function')
ax2.set_ylabel('squared sum loss')
ax2.set_xlabel('slope a')
ax2.legend()

#

Best $a$ with least loss is 0.99

Out[153]:

<matplotlib.legend.Legend at 0x149022280>

Rosen et al. presented a theorem that: for a general factor potential function $f(x)$, e.g. Gaussian or GMM, suppose we have $f_i < 0$, $\|f_i\|_\infty < \infty$, $c_i > \|f_i\|_\infty$, then maximizing the likelihood is the same as minimizing a least square function. $$ \begin{align} \underset{x}{\operatorname{argmax}}f(x) =& \underset{x}{\operatorname{argmax}}f(x) \prod f_i(x) \\ =& \underset{x}{\operatorname{argmin}} \|r_i(x)\|^2 \end{align} $$ where $r_i(x) = \sqrt{log(c_i) - log(f_i(x))}$.

Taking the GMM example, we can think of $f_i(x) = \sum_k\pi_kN(\mu_k, \Sigma_k)$. Then a possible selection $c_i$ is $\sum_k^K{\frac{\pi_k}{\sqrt{(2\pi)^l\|\Sigma_k\|}}}$ where $\mu_k \in \mathbb{R}^l$. This is because the exponential part of a Gaussian is always smaller than $1$. So now we can formulate maximizing log likelihood as a least square problem again.

$$ \begin{align} x^* =& \underset{x}{\operatorname{argmax}} \sum_i\sum_k{log(w_k^i\eta_k^i e^{-\frac{1}{2}(x-\mu_k^i)^T{\Sigma^i_{k^*}}^{-1}(x-\mu_k^i)})} \\ =& \underset{x}{\operatorname{argmin}} \sum_i\sum_k\|\sqrt{log(c^i_k) - log(f^i_k(x))}\|^2 \end{align} $$

In [157]:

# Use Sum mixture model to reject outliers. Optimization is done as it is a least square problem.

import matplotlib.pyplot as plt
import numpy as np
from numpy.random import default_rng
from scipy.stats import norm
import math

npoints = 100
a = 1
x = np.linspace(-3, 3, npoints)
y = a * x 
y_hat = y + np.random.normal(0, 0.1, x.shape)

n_outliers = 30
rng = np.random.default_rng()
outlier_indices = rng.choice(npoints, size=n_outliers, replace=False)
# In most practical applications, residuals are not Gaussian distributed.
y_hat[outlier_indices] = np.random.uniform(-3, 3, n_outliers)

# Suppose residuals follow Gaussian mixture distribution and we have two mixtures.
# Both mixtures have the same mean as 0 but different sigmas.
w1 = 0.6
s1 = 0.1
w2 = 1-w1
s2 = 1

a_hat = np.linspace(-1, 3, 1000)
losses = []
for t in a_hat:
    residuals = t*x-y_hat
    # TODO: We should check log-likelihood to avoid numerical issue.
    p1s = w1 * norm.pdf(residuals, loc=0, scale=s1)
    p2s = w2 * norm.pdf(residuals, loc=0, scale=s2)
    c = w1 / np.sqrt(2*math.pi*s1*s1) + w2 / np.sqrt(2*math.pi*s2*s2)
    loss = 0
    for p1, p2, residual in zip(p1s, p2s, residuals):
        # Compare to previous section, the cost surface is just shifted
        loss += (np.sqrt(np.log(c) - np.log(p1 + p2))) ** 2
    loss /= x.shape[0]
    losses.append(loss)

a_best = a_hat[np.argmin(losses)]
print(f'Best $a$ with least loss is {a_best:.2f}')

fig, (ax1, ax2) = plt.subplots(
    nrows=2, ncols=1, figsize=(6,6), dpi=100)

ax1.plot(x, y, '-b', label='ground truth line y = x')
ax1.plot(x, a_best*x, '-g', label=f'predicted line y = {a_best:.2f}x')
ax1.plot(x, y_hat, 'r.', label='noisy observations')
ax1.plot(x[outlier_indices], y_hat[outlier_indices], 'k*', label='outliers')
ax1.set_ylabel('y')
ax1.set_xlabel('x')
ax1.legend()

ax2.plot(a_hat, losses, '-g', label='loss function')
ax2.set_ylabel('squared sum loss')
ax2.set_xlabel('slope a')
ax2.legend()

#

Best $a$ with least loss is 1.00

Out[157]:

<matplotlib.legend.Legend at 0x149397f10>

General Mixture Model based Least Square Optimization ¶

Now, we will try to formulate the GMM model based potentials into a least square optimization problem so that we can easily feed the cost function into solvers like Ceres.

Recall that we would like to minimize the least square problem as below: $$ \begin{align} x^* =& \underset{x}{\operatorname{argmin}} \sum_i{\|r_i(x_i, z_i)\|_{\Sigma}}^2 \\ =& \underset{x}{\operatorname{argmin}} \sum_i{\|\mathcal{I}^\frac{1}{2}r_i(x_i, z_i)\|^2} \end{align} $$

Then assume that residuals follow GMM: $$ p(r) \sim \sum_l^Ls_l e^{e_l(r)} $$ where $s_l = w_l \|\mathcal{I}^\frac{1}{2}\|$, $w_l = \frac{\pi_l}{\sqrt(2\pi)^?}$ $e_l(r) = -\frac{1}{2}\|\mathcal{I}_l^\frac{1}{2}(r-\mu_l)\|^2$.

Then the log-likelihood is: $$ \begin{align} l =& log(p(r)) \\ =& log(\sum_l^L s_l e^{e_l}) \\ =& log(\sum_l^L \frac{ s_l e^{e_l}}{e^{e_k}}e^{e_k}) \\ =& e_k + log(s_l e^{e_l - e_k}) \\ =& l(s, e_k, \tilde{e}) \end{align} $$ where $k = \underset{l}{\operatorname{argmax}}s_le^{e_l}$. Note this is similar to the max-mixture idea which only keeps the linear part $e_k$.

Recall the conclusion from previous section that $$ \begin{align} x^* =& \underset{x}{\operatorname{argmin}} \sum_i\|\sqrt{log(c^i_k) - log(f^i_k(x))}\|^2 \end{align} $$

Here we can set a $c$ that \begin{align} x^* =& \underset{x}{\operatorname{argmin}} \sum_i\|\sqrt{log(c^i_k) - l(s^i, e_k^i, \tilde{e}^i)}\|^2 \\ =& \underset{x}{\operatorname{argmin}} \sum_i\|\sqrt{-l(\frac{s^i}{c^i}, e_k^i, \tilde{e}^i)}\|^2 \end{align} where $c^i = \sum^L s_l^i$ or $c^i = L \underset{l}{\operatorname{max}}s_l^i + \delta$, or ...

I thought the $c$ only shifts cost surface so it wouldn't matter a lot for maximizing log-likelihood. But some experiments found it matters. Pfeifer et al. also point out that pulling the linear part of residual make the convergence better. However, I am wondering what the covariance mean as it has one extra dimension?

A summary of residual function:

Sum-Max-Mixture
- $r = \begin{bmatrix} \sqrt{-e_k} \\ \sqrt{-l(\frac{s}{c}, e_k, \tilde{e})}\end{bmatrix}$, $c = L \underset{l}{\operatorname{max}}s_l + \delta$
Sum-Mixture
- $r = \sqrt{-e_k-l(\frac{s}{c}, e_k, \tilde{e})}$, $c = \sum^L s_l$
Max-Mixture variation
- $r = \begin{bmatrix} \sqrt{-e_k} \\ \sqrt{-log(\frac{s_k}{c})}\end{bmatrix}$, $c = \underset{l}{\operatorname{max}}s_l + \delta$
Max-Mixture variation
- $r = \sqrt{-e_k -log(\frac{s_k}{c})}$, $c = \underset{l}{\operatorname{max}}s_l + \delta$
Max-Mixture
- $r = \sqrt{-e_k}$

Robust Kernel with Gaussian Mixture ¶

A mixed flavor of all things. I am not sure whether this makes sense mathematically. But if we think robust kernel is just adding extra weights to the original least square problem. Then it feels that this could also work.

In [159]:

# Use Sum mixture model to reject outliers. Optimization is done as it is a least square problem.

import matplotlib.pyplot as plt
import numpy as np
from numpy.random import default_rng
from scipy.stats import norm
import math

def huber_loss(r, k):
    y = np.zeros(r.shape)
    y[np.abs(r) < k] = 0.5 * (r[np.abs(r) < k] ** 2)
    y[np.abs(r) >= k] = k * (np.abs(r[np.abs(r) >= k]) - 0.5*k)
    return y

npoints = 100
a = 1
x = np.linspace(-3, 3, npoints)
y = a * x 
y_hat = y + np.random.normal(0, 0.1, x.shape)

n_outliers = 30
rng = np.random.default_rng()
outlier_indices = rng.choice(npoints, size=n_outliers, replace=False)
# In most practical applications, residuals are not Gaussian distributed.
y_hat[outlier_indices] = np.random.uniform(-3, 3, n_outliers)

# Suppose residuals follow Gaussian mixture distribution and we have two mixtures.
# Both mixtures have the same mean as 0 but different sigmas.
w1 = 0.6
s1 = 0.1
w2 = 1-w1
s2 = 1

a_hat = np.linspace(-1, 3, 1000)
losses = []
for t in a_hat:
    residuals = t*x-y_hat
    # TODO: We should check log-likelihood to avoid numerical issue.
    p1s = w1 * norm.pdf(residuals, loc=0, scale=s1)
    p2s = w2 * norm.pdf(residuals, loc=0, scale=s2)
    c = w1 / np.sqrt(2*math.pi*s1*s1) + w2 / np.sqrt(2*math.pi*s2*s2)
    loss = 0
    for p1, p2, residual in zip(p1s, p2s, residuals):
        # Compare to previous section, the cost surface is just shifted
        loss += huber_loss(np.sqrt(np.log(c) - np.log(p1 + p2)), 1)
    loss /= x.shape[0]
    losses.append(loss)

a_best = a_hat[np.argmin(losses)]
print(f'Best $a$ with least loss is {a_best:.2f}')

fig, (ax1, ax2) = plt.subplots(
    nrows=2, ncols=1, figsize=(6,6), dpi=100)

ax1.plot(x, y, '-b', label='ground truth line y = x')
ax1.plot(x, a_best*x, '-g', label=f'predicted line y = {a_best:.2f}x')
ax1.plot(x, y_hat, 'r.', label='noisy observations')
ax1.plot(x[outlier_indices], y_hat[outlier_indices], 'k*', label='outliers')
ax1.set_ylabel('y')
ax1.set_xlabel('x')
ax1.legend()

ax2.plot(a_hat, losses, '-g', label='loss function')
ax2.set_ylabel('squared sum loss')
ax2.set_xlabel('slope a')
ax2.legend()

#

Best $a$ with least loss is 1.00

Out[159]:

<matplotlib.legend.Legend at 0x14952f3a0>

Expectation Maximization for GMM based Factor Graph ¶

All GMM parameters shown in the previous section are not dynamically tuned. These parameters could be learned from offline to better represent the residual distributions, though in reality they are tuned by hand. EM algorithm is guaranteed to converge though the solution might be a local optimal. However, iterating between optimizing state $x$ and GMM parameters $\theta$ is not guaranteed to converge. So it'd better to use it offline.

So as described above, the algorithm will be iterating between optimizing state $x$ and GMM parameters $\theta$:

Step 1: Optimize state $x$ $$ x^* = \underset{x}{\operatorname{argmax}} \prod p (z_i | x, \theta) $$
Step 2: Optimize GMM parameter $\theta$ $$ \begin{align} r_{nk} =& \frac{\pi_{Z_k}N(r(x_n,z_n) | \mu_{Z_k}, \Sigma_{Z_k})}{\sum_{Z_k}\pi_{Z_k}N(r(x_n,z_n) | \mu_{Z_k}, \Sigma_{Z_k})} \\ \pi_k =& \frac{\sum_n^N r_{nk}}{N} \\ \mu_k =& \frac{\sum_n^N r_{nk}r(x_n,z_n)}{\sum_n^N r_{nk}} \\ \Sigma_k =&\frac{\sum_n^N r_{nk}(r(x_n,z_n)- \mu_k) (r(x_n,z_n)- \mu_k)^T}{\sum_n^N r_{nk}} \\ \end{align} $$ where $r(x_n,z_n)$ is the residual.

Personally, I don't see the EM estimated parameters make huge improvement for mixture model. It often runs into numerical problems (probably due to my naive implementation).

In [262]:

# Use Sum mixture model to reject outliers and GMM is tunned using EM.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from numpy.random import default_rng
from scipy.stats import norm
import math

mu1_gt = 0
mu2_gt = 2
w1_gt = 0.7
s1_gt = 0.5
w2_gt = 0.3
s2_gt = 3

# Suppose residuals follow Gaussian mixture distribution and we have two mixtures.
mu1 = 0
mu2 = 0
w1 = 0.5
s1 = 1
w2 = 0.5
s2 = 2

npoints = 1000
a = 1
x = np.linspace(-3, 3, npoints)
y = a * x 
y_hat = y  + np.random.normal(mu1_gt, s1_gt, x.shape)
n_outliers = int(0.3 * npoints)
rng = np.random.default_rng()
outlier_indices = rng.choice(npoints, size=n_outliers, replace=False)
# In most practical applications, residuals are not Gaussian distributed.
y_hat[outlier_indices] = y[outlier_indices] + np.random.normal(mu2_gt , s2_gt, n_outliers) 
# y_hat[outlier_indices] = np.random.uniform(-3, 3, n_outliers)

# Sum mixture losses
a_hat = np.linspace(-1, 3, 1000)
losses = []
for t in a_hat:
    residuals = t*x-y_hat
    # TODO: We should check log-likelihood to avoid numerical issue.
    p1s = w1 * norm.pdf(residuals, loc=mu1, scale=s1)
    p2s = w2 * norm.pdf(residuals, loc=mu2, scale=s2)
    c = w1 / np.sqrt(2*math.pi*s1*s1) + w2 / np.sqrt(2*math.pi*s2*s2)
    loss = 0
    for p1, p2, residual in zip(p1s, p2s, residuals):
        # Compare to previous section, the cost surface is just shifted
        loss += (np.sqrt(np.log(c) - np.log(p1 + p2))) ** 2
    loss /= x.shape[0]
    losses.append(loss)

a_best = a_hat[np.argmin(losses)]
print(f'Best $a$ with least loss is {a_best:.2f}')

fig, (ax1, ax2, ax3, ax4) = plt.subplots(
    nrows=4, ncols=1, figsize=(6,6), dpi=100)

ax1.plot(x, y, '-b', label='ground truth line y = x')
ax1.plot(x, a_best*x, '-g', label=f'predicted line y = {a_best:.2f}x')
ax1.plot(x, y_hat, 'r.', label='noisy observations')
ax1.plot(x[outlier_indices], y_hat[outlier_indices], 'k*', label='outliers')
ax1.set_ylabel('y')
ax1.set_xlabel('x')
ax1.legend()
ax1.set_title('Before EM-based GMM')

ax2.plot(a_hat, losses, '-g', label='loss function')
ax2.set_ylabel('squared sum loss')
ax2.set_xlabel('slope a')
ax2.legend()
ax2.set_title('Before EM-based GMM')

# EM for GMM
pi = [w1, w2]
mu = [mu1, mu2]
sigma = [s1, s2]
print(f'Inital parameters: \n pi:{pi}\n mu:{mu} \nsigma:{sigma} ')
lbs = []
samples = y_hat - a_best * x
assert(pi[0] + pi[1] == 1)
lbs = [] # lower bounds E_q + H(q)

def get_rnk(sample, pi, mu, sigma):
    l = np.array([np.log(pi_) - np.log(np.sqrt(2*math.pi)*sigma_) - 0.5 * (samples[n] - mu_)**2 / (sigma_**2) 
                  for pi_, mu_, sigma_ in zip(pi, mu, sigma)])
    l_ = l - np.max(l) # np.min(l) # use min or max?
    likelihood = np.exp(l_)
    r_nk = np.squeeze(likelihood / np.sum(likelihood))
    lb = np.sum(r_nk.reshape(-1,1) * l.reshape(-1,1) - (r_nk*np.log(r_nk)).reshape(-1,1)) # E_q + H(q)
    # lb = r_nk[0]*l[0]+r_nk[1]*l[1]-r_nk[1]*np.log(r_nk[1])-r_nk[0]*np.log(r_nk[0])
    return r_nk, lb

maxiter = 10  
for i in range(maxiter):
    new_pi = np.array([0., 0.])
    new_mu = np.array([0., 0.])
    new_var = np.array([0., 0.])
    lb_sum = 0
    for n in range(N):
        # Solve numerical issue when likelihood is very very small.
        r_nk, lb = get_rnk(samples[n], pi, mu, sigma)
        new_pi += r_nk
        new_mu += r_nk * samples[n]
        new_var += r_nk * samples[n] * samples[n]
        lb_sum += lb
    lbs.append(lb_sum)
    #print(f'iter: {i}, lower bounds: {b_sum.item():.2f}, pi: {pi}, mu: {mu}, sigma: {sigma}')
    print(f'iter: {i}, lower bounds: {lb_sum.item():.2f}')
    if len(lbs) > 1 and np.abs(lbs[-2] - lbs[-1]) < 1E-8:
        print('It converged...')
        break
    rk = new_pi
    pi = rk / N
    mu = new_mu / rk
    sigma = np.sqrt(new_var/rk - mu * mu)
    
# Re-derive the loss based on optimized GMM parameters.
# TODO: Put the code into a function.
print(f'Final parameters: \n pi:{pi}\n mu:{mu} \nsigma:{sigma} ')
mu1 = mu[0]
mu2 = mu[1]
w1 = pi[0]
s1 = sigma[0]
w2 = pi[1]
s2 = sigma[1]

losses = []
for t in a_hat:
    residuals = t*x-y_hat
    # TODO: We should check log-likelihood to avoid numerical issue.
    p1s = w1 * norm.pdf(residuals, loc=mu1, scale=s1)
    p2s = w2 * norm.pdf(residuals, loc=mu2, scale=s2)
    c = w1 / np.sqrt(2*math.pi*s1*s1) + w2 / np.sqrt(2*math.pi*s2*s2)
    loss = 0
    for p1, p2, residual in zip(p1s, p2s, residuals):
        # Compare to previous section, the cost surface is just shifted
        loss += (np.sqrt(np.log(c) - np.log(p1 + p2))) ** 2
    loss /= x.shape[0]
    losses.append(loss)
    
a_best = a_hat[np.argmin(losses)]
print(f'Best $a$ with least loss is {a_best:.2f}')

ax3.plot(x, y, '-b', label='ground truth line y = x')
ax3.plot(x, a_best*x, '-g', label=f'predicted line y = {a_best:.2f}x')
ax3.plot(x, y_hat, 'r.', label='noisy observations')
ax3.plot(x[outlier_indices], y_hat[outlier_indices], 'k*', label='outliers')
ax3.set_ylabel('y')
ax3.set_xlabel('x')
ax3.legend()
ax3.set_title('After EM-based GMM')

ax4.plot(a_hat, losses, '-g', label='loss function')
ax4.set_ylabel('squared sum loss')
ax4.set_xlabel('slope a')
ax4.legend()
ax4.set_title('After EM-based GMM')

#

Best $a$ with least loss is 0.98
Inital parameters: 
 pi:[0.5, 0.5]
 mu:[0, 0] 
sigma:[1, 2] 
iter: 0, lower bounds: -1996.09
iter: 1, lower bounds: -1735.90
iter: 2, lower bounds: -1667.49
iter: 3, lower bounds: -1655.19
iter: 4, lower bounds: -1653.62
iter: 5, lower bounds: -1653.34
iter: 6, lower bounds: -1653.24
iter: 7, lower bounds: -1653.20
iter: 8, lower bounds: -1653.18
iter: 9, lower bounds: -1653.17
Final parameters: 
 pi:[0.69314581 0.30685419]
 mu:[1.06846129e-03 1.94773296e+00] 
sigma:[0.51123151 3.05574671] 
Best $a$ with least loss is 1.01

Out[262]:

Text(0.5, 1.0, 'After EM-based GMM')

Self-supervise Learned Robust Kernel ¶

What do we want the network learn here? I think we hope it can learn the true distribution of residuals. It could be a arbitrary distribution.

References:

Olson, Inference on networks of mixtures for robust robot mapping
Engel, Large-Scale Direct SLAM and 3D Reconstruction in Real-Time
Barron, A General and Adaptive Robust Loss Function
Chebrolu, Adaptive Robust Kernels for Non-Linear Least Squares Problems
Pfeifer, https://www.tu-chemnitz.de/etit/proaut/en/research/libRSF.html

This blog is converted from paper-reading-mixture-models-for-least-square-optimization.ipynb

Written on May 11, 2021