Mixture Models for Least Square Optimization

Problem Formulation

The goal is to infer the posterior distribution of the state variable $x$ conditioned on sensor observations z $$ x^* = \underset{x}{\operatorname{argmax}} p (x | z) $$ By applying Bayes' rule and assuming uninformative prior $p(x)$, maximizing posterior distribution is same as maximizing likelihood distribution: $$ x^* = \underset{x}{\operatorname{argmax}} p (x | z) = x^* = \underset{x}{\operatorname{argmax}} p (z | x) $$ If we assume all observations $z$ are independent given some related state variable $x$, we can get: $$ x^* = \underset{x}{\operatorname{argmax}} \prod p (z_i | x) $$

Factor Potentials as Gaussian

If we assume $p(z_i | x) \sim N (f_i(x)-z_i, \Sigma_i)$, then we can maximize the log-likelihood as: $$ x^* = \underset{x}{\operatorname{argmax}} \sum_i{log(\eta_i e^{-\frac{1}{2}(f_i(x)-z_i)^T\Sigma_i^{-1}(f_i(x)-z_i)})} $$ It then turns into a least square problem: $$ x^* = \underset{x}{\operatorname{argmin}} \sum_i{(f_i(x)-z_i)^T\Sigma_i^{-1}(f_i(x)-z_i)} = \underset{x}{\operatorname{argmin}} \sum_i{\|f_i(x) - z_i\|_{\Sigma_i}} $$ For the sack of simplicity, we assume $f_i$ are linear functions (We can always linearize it as $f_i(x) \approx J_i(x-x_0) + f_x(x_0)$). Then this becomes a linear least square problem, and the cost function is quadratic and convex.

A Simple Linear Regression

Consider a very simple linear regression problem. In order to visualize the loss vs parameter in a 2D plot, we parameterize the line as $y = ax$. If we assume $\Sigma_i = I$, then $$ a^* = \sum_{i}(ax_i - y_i)^2 $$ As shown in the figure below, the cost function is clearly a quadratic function.

In [48]:
Best $a$ with least loss is 1.00
Out[48]:
<matplotlib.legend.Legend at 0x13799cfa0>

A Simple Linear Regression With Outliers

However, the approach described above is very sensitive to outliers. Let's add a couple of outliers to the measurements. As you can see, the estimation is off. This is because these outliers contribute more errors to the cost function. The optimizer tries to reduce these errors.

In [49]:
Best $a$ with least loss is 0.73
Out[49]:
<matplotlib.legend.Legend at 0x137a51e50>

Robust Kernels

How do we make the optimization more robust to outliers?

  1. Detect the outliers before running optimization so that we never use them in the optimization.
  2. Give less weights for outliers in the optimization. Outliers usually have large errors (residual) compared to normal data assume that the initial guesses of the state variables are not very off.

For (2), we can replace the $L_2$ norm with robust kernels (M-estimators): $$ x^* = \underset{x}{\operatorname{argmin}} \sum_i{\rho(r_i(x))} $$ where $r_i(x) = |f_i(x) - z_i|{\Sigma_i}^{-\frac{1}{2}}$, $\rho()$ is a function designed to reduce the influence of large residuals.

Huber Loss

One common robust kernel is Huber loss: $$ \rho(r) = \begin{cases} \frac{r^2}{2} &\text{$|r| < k$}\\ k(|r|-\frac{k}{2}) &\text{$|r| \geq k$} \end{cases} $$

$$ \frac{\partial L}{\partial x} = \sum_i{\frac{\partial \rho(r_i)}{\partial r_i}\frac{\partial r_i(x)}{\partial x_i}} $$

We define $\phi = \frac{\partial \rho(r_i)}{\partial r_i}$, then $$ \frac{\partial L}{\partial x} = \sum_i{\frac{\phi}{r_i}r_i\frac{\partial r_i(x)}{\partial x_i}} $$

We define $w = \frac{\phi}{r_i}$, then it turns out that we can formulate loss L as: $$ L = \sum_i{w(r_i(x_c))r^2_i(x)} $$ where, $x_c$ is the current most recent estimate of $x$, aka linearization point.

For huber loss, the weight function is: $$ w(r) = \begin{cases} 1 &\text{$|r| < k$}\\ \frac{k}{|r|} &\text{$|r| \geq k$} \end{cases} $$

By setting the weight function as $w = \frac{1}{r_i}\frac{\partial \rho(r_i)}{\partial r_i} |_{r_i = r_i(x_c)}$, we can solve the the optimization problem by using the existing techniques for weighted least-squares such as Gauss-Newton, Levenberg-Marquardt, etc.

In [50]:
Out[50]:
<matplotlib.legend.Legend at 0x1376028b0>

Now, we can apply Huber loss to the linear regression problem.

In [51]:
Best $a$ with least loss is 1.00
Out[51]:
<matplotlib.legend.Legend at 0x137bda580>

A Generalized Robust Kernel

A single generalized robust kernel can be express as:

$$ \rho(r, \alpha, k) = \frac{|\alpha-2|}{\alpha}((\frac{(\frac{|r|}{k})^2}{|\alpha-2|}+1)^{\frac{\alpha}{2}}-1) $$

where $\alpha$ is a real-valued parameter that controls the shape of the kernel and $k > 0$ is the scale parameter that determines the size of quadratic loss region around $r = 0$.

With removing singularities:

$$ \rho(r, \alpha, k) = \begin{cases} \frac{1}{2}(\frac{|r|}{k})^2 &\text{$\alpha = 2$}\\ log(\frac{1}{2}(\frac{|r|}{k})^2+1) &\text{$\alpha = 0$}\\ 1 - e^{-\frac{1}{2}(\frac{|r|}{k})^2} &\text{$\alpha = -\infty$}\\ \frac{|\alpha-2|}{\alpha}((\frac{(\frac{|r|}{k})^2}{|\alpha-2|}+1)^{\frac{\alpha}{2}}-1) &\text{otherwise} \end{cases} $$

Then weight function $w$ is: $$ w(r, \alpha, k) = \begin{cases} \frac{1}{k^2} &\text{$\alpha = 2$}\\ \frac{2}{r^2+2k^2} &\text{$\alpha = 0$}\\ \frac{1}{k^2} e^{-\frac{1}{2}(\frac{|r|}{k})^2} &\text{$\alpha = -\infty$}\\ \frac{1}{k^2}(\frac{(\frac{|r|}{k})^2}{|\alpha-2|}+1)^{\frac{\alpha}{2}-1} &\text{otherwise} \end{cases} $$

In [56]:
Out[56]:
<matplotlib.legend.Legend at 0x1445a3d00>

Construct Probability Distributions from Robust Kernels

Recall that we are trying to maximize the log-likelihood: $$ x^* = \underset{x}{\operatorname{argmax}} \sum_i log(p(z_i|x)) $$ In this section, we are trying to minimize the robust cost: $$ x^* = \underset{x}{\operatorname{argmin}} \sum_i{\rho(r_i(x))} $$

By connecting these two together, we can get: $$ p(z_i | x) = p(r_i(x, z_i)) = \frac{1}{kZ(\alpha)}e^{-\rho(r_i, \alpha, k)} $$ where $Z(\alpha) = \int{e^{-\rho(r_i, \alpha, 1)}}$

When optimizing hyperparameter $\alpha$ and state variable $x$ together, there are a couple of things to note:

  1. $Z(\alpha)$ is unbounded for $\alpha < 0$. So a solution will be cut off integral to be in $[-\tau, \tau]$. Then, the probability distribution is not well normalized. But for maximizing log-likelihood purpose, it should work fine.
  2. $x^*, \alpha^* = \underset{x, \alpha}{\operatorname{argmax}} \sum_i log(p_{\alpha}(z_i|x)) = \underset{x, \alpha}{\operatorname{argmin}}\sum_i\rho(r_i,\alpha,k) + log(kZ(\alpha))$. Be aware that the final cost function has a regularization term.

Factor Potentials as Gaussian Mixtures

In the previous section, we assume the likelihood $p(z|x)$ follows Gaussian distribution. As we pointed out earlier, in practice, residuals sometime are not Gaussian distributed. This is exactly why we come up with robust kernels. However, using these robust kernels still assumes that residuals follows a symmetric distribution. Now let's approach the problem in a different way which assumes that residuals follow a Gaussian mixture model:

$$ p(z_i | x) \sim \sum_kw^i_kN(\mu^i_k, \Sigma^i_k) $$

As shown in the figure below, the mixture model (red) is not symmetric anymore. It can represent arbitrary distribution given proper components. But this also brings a difficulty in optimization. Remember we are trying to maximize log-likelihood and we can convert it into a least square problem with the assumption that residuals follow a single Gaussian distribution. But without Gaussian mixture, we can we do? $$ x^* = \underset{x}{\operatorname{argmax}} \prod p (z_i | x) = \underset{x}{\operatorname{argmax}} \sum_i log(\sum_kw^i_kN(\mu^i_k, \Sigma^i_k)) $$

In [55]:
Out[55]:
<matplotlib.legend.Legend at 0x144403760>

Max-Mixture Model

In order to convert the maximization of log-likelihood with Gaussian mixture into a nice least square problem, we try to use the max component in the mixtures to approximate it:

$$ p(z_i | x) \sim \sum_kw^i_kN(\mu^i_k, \Sigma^i_k) \sim \frac{\underset{k}{\operatorname{max}}\sum_kw^i_kN(\mu^i_k, \Sigma^i_k)}{\gamma^i} $$

Note that a normalization factor $\gamma$ is required in order to ensure that a max mixture integrates to $1$. However, for maximizing log-likelihood, it is just added as a constant ($-log(\gamma)$). So it won't affect the final solution of $x^*$:

$$ x^* = \underset{x}{\operatorname{argmax}} \sum_i{log(\frac{w^i_{k^*}}{\gamma^i}\eta_i e^{-\frac{1}{2}(\mu^i_{k^*})^T{\Sigma^i_{k^*}}^{-1}(\mu^i_{k^*})})} = \underset{x}{\operatorname{argmax}} \sum_i{log(e^{-\frac{1}{2}(\mu^i_{k^*})^T{\Sigma^i_{k^*}}^{-1}(\mu^i_{k^*})})} $$

I general max-mixture is a very good approximation to sum-mixture. I think a good practice is that when you define mixture model of the residuals, check how close the max-mixture matches the sum-mixture.

In [86]:
Out[86]:
<matplotlib.legend.Legend at 0x145e67a60>
In [160]:
Best $a$ with least loss is 1.00
Out[160]:
<matplotlib.legend.Legend at 0x1495a53d0>

Sum-Mixture Model

When using sum-mixture model, it is clear that we can't convert it into a nice least square problem. In other word, the logarithm can no longer be pushed all the way into the individual Gaussian components - the summation in prevents it. $$ x^* = \underset{x}{\operatorname{argmax}} \sum_i\sum_k{log(w_k^i\eta_k^i e^{-\frac{1}{2}(x-\mu_k^i)^T{\Sigma^i_{k^*}}^{-1}(x-\mu_k^i)})} $$

In [153]:
Best $a$ with least loss is 0.99
Out[153]:
<matplotlib.legend.Legend at 0x149022280>

Rosen et al. presented a theorem that: for a general factor potential function $f(x)$, e.g. Gaussian or GMM, suppose we have $f_i < 0$, $\|f_i\|_\infty < \infty$, $c_i > \|f_i\|_\infty$, then maximizing the likelihood is the same as minimizing a least square function. $$ \begin{align} \underset{x}{\operatorname{argmax}}f(x) =& \underset{x}{\operatorname{argmax}}f(x) \prod f_i(x) \\ =& \underset{x}{\operatorname{argmin}} \|r_i(x)\|^2 \end{align} $$ where $r_i(x) = \sqrt{log(c_i) - log(f_i(x))}$.

Taking the GMM example, we can think of $f_i(x) = \sum_k\pi_kN(\mu_k, \Sigma_k)$. Then a possible selection $c_i$ is $\sum_k^K{\frac{\pi_k}{\sqrt{(2\pi)^l\|\Sigma_k\|}}}$ where $\mu_k \in \mathbb{R}^l$. This is because the exponential part of a Gaussian is always smaller than $1$. So now we can formulate maximizing log likelihood as a least square problem again.

$$ \begin{align} x^* =& \underset{x}{\operatorname{argmax}} \sum_i\sum_k{log(w_k^i\eta_k^i e^{-\frac{1}{2}(x-\mu_k^i)^T{\Sigma^i_{k^*}}^{-1}(x-\mu_k^i)})} \\ =& \underset{x}{\operatorname{argmin}} \sum_i\sum_k\|\sqrt{log(c^i_k) - log(f^i_k(x))}\|^2 \end{align} $$
In [157]:
Best $a$ with least loss is 1.00
Out[157]:
<matplotlib.legend.Legend at 0x149397f10>

General Mixture Model based Least Square Optimization

Now, we will try to formulate the GMM model based potentials into a least square optimization problem so that we can easily feed the cost function into solvers like Ceres.

Recall that we would like to minimize the least square problem as below: $$ \begin{align} x^* =& \underset{x}{\operatorname{argmin}} \sum_i{\|r_i(x_i, z_i)\|_{\Sigma}}^2 \\ =& \underset{x}{\operatorname{argmin}} \sum_i{\|\mathcal{I}^\frac{1}{2}r_i(x_i, z_i)\|^2} \end{align} $$

Then assume that residuals follow GMM: $$ p(r) \sim \sum_l^Ls_l e^{e_l(r)} $$ where $s_l = w_l \|\mathcal{I}^\frac{1}{2}\|$, $w_l = \frac{\pi_l}{\sqrt(2\pi)^?}$ $e_l(r) = -\frac{1}{2}\|\mathcal{I}_l^\frac{1}{2}(r-\mu_l)\|^2$.

Then the log-likelihood is: $$ \begin{align} l =& log(p(r)) \\ =& log(\sum_l^L s_l e^{e_l}) \\ =& log(\sum_l^L \frac{ s_l e^{e_l}}{e^{e_k}}e^{e_k}) \\ =& e_k + log(s_l e^{e_l - e_k}) \\ =& l(s, e_k, \tilde{e}) \end{align} $$ where $k = \underset{l}{\operatorname{argmax}}s_le^{e_l}$. Note this is similar to the max-mixture idea which only keeps the linear part $e_k$.

Recall the conclusion from previous section that $$ \begin{align} x^* =& \underset{x}{\operatorname{argmin}} \sum_i\|\sqrt{log(c^i_k) - log(f^i_k(x))}\|^2 \end{align} $$

Here we can set a $c$ that \begin{align} x^* =& \underset{x}{\operatorname{argmin}} \sum_i\|\sqrt{log(c^i_k) - l(s^i, e_k^i, \tilde{e}^i)}\|^2 \\ =& \underset{x}{\operatorname{argmin}} \sum_i\|\sqrt{-l(\frac{s^i}{c^i}, e_k^i, \tilde{e}^i)}\|^2 \end{align} where $c^i = \sum^L s_l^i$ or $c^i = L \underset{l}{\operatorname{max}}s_l^i + \delta$, or ...

I thought the $c$ only shifts cost surface so it wouldn't matter a lot for maximizing log-likelihood. But some experiments found it matters. Pfeifer et al. also point out that pulling the linear part of residual make the convergence better. However, I am wondering what the covariance mean as it has one extra dimension?

A summary of residual function:

  • Sum-Max-Mixture

    • $r = \begin{bmatrix} \sqrt{-e_k} \\ \sqrt{-l(\frac{s}{c}, e_k, \tilde{e})}\end{bmatrix}$, $c = L \underset{l}{\operatorname{max}}s_l + \delta$
  • Sum-Mixture

    • $r = \sqrt{-e_k-l(\frac{s}{c}, e_k, \tilde{e})}$, $c = \sum^L s_l$
  • Max-Mixture variation

    • $r = \begin{bmatrix} \sqrt{-e_k} \\ \sqrt{-log(\frac{s_k}{c})}\end{bmatrix}$, $c = \underset{l}{\operatorname{max}}s_l + \delta$
  • Max-Mixture variation

    • $r = \sqrt{-e_k -log(\frac{s_k}{c})}$, $c = \underset{l}{\operatorname{max}}s_l + \delta$
  • Max-Mixture

    • $r = \sqrt{-e_k}$

Robust Kernel with Gaussian Mixture

A mixed flavor of all things. I am not sure whether this makes sense mathematically. But if we think robust kernel is just adding extra weights to the original least square problem. Then it feels that this could also work.

In [159]:
Best $a$ with least loss is 1.00
Out[159]:
<matplotlib.legend.Legend at 0x14952f3a0>

Expectation Maximization for GMM based Factor Graph

All GMM parameters shown in the previous section are not dynamically tuned. These parameters could be learned from offline to better represent the residual distributions, though in reality they are tuned by hand. EM algorithm is guaranteed to converge though the solution might be a local optimal. However, iterating between optimizing state $x$ and GMM parameters $\theta$ is not guaranteed to converge. So it'd better to use it offline.

So as described above, the algorithm will be iterating between optimizing state $x$ and GMM parameters $\theta$:

  • Step 1: Optimize state $x$ $$ x^* = \underset{x}{\operatorname{argmax}} \prod p (z_i | x, \theta) $$

  • Step 2: Optimize GMM parameter $\theta$ $$ \begin{align} r_{nk} =& \frac{\pi_{Z_k}N(r(x_n,z_n) | \mu_{Z_k}, \Sigma_{Z_k})}{\sum_{Z_k}\pi_{Z_k}N(r(x_n,z_n) | \mu_{Z_k}, \Sigma_{Z_k})} \\ \pi_k =& \frac{\sum_n^N r_{nk}}{N} \\ \mu_k =& \frac{\sum_n^N r_{nk}r(x_n,z_n)}{\sum_n^N r_{nk}} \\ \Sigma_k =&\frac{\sum_n^N r_{nk}(r(x_n,z_n)- \mu_k) (r(x_n,z_n)- \mu_k)^T}{\sum_n^N r_{nk}} \\ \end{align} $$ where $r(x_n,z_n)$ is the residual.

Personally, I don't see the EM estimated parameters make huge improvement for mixture model. It often runs into numerical problems (probably due to my naive implementation).

In [262]:
Best $a$ with least loss is 0.98
Inital parameters: 
 pi:[0.5, 0.5]
 mu:[0, 0] 
sigma:[1, 2] 
iter: 0, lower bounds: -1996.09
iter: 1, lower bounds: -1735.90
iter: 2, lower bounds: -1667.49
iter: 3, lower bounds: -1655.19
iter: 4, lower bounds: -1653.62
iter: 5, lower bounds: -1653.34
iter: 6, lower bounds: -1653.24
iter: 7, lower bounds: -1653.20
iter: 8, lower bounds: -1653.18
iter: 9, lower bounds: -1653.17
Final parameters: 
 pi:[0.69314581 0.30685419]
 mu:[1.06846129e-03 1.94773296e+00] 
sigma:[0.51123151 3.05574671] 
Best $a$ with least loss is 1.01
Out[262]:
Text(0.5, 1.0, 'After EM-based GMM')

Self-supervise Learned Robust Kernel

What do we want the network learn here? I think we hope it can learn the true distribution of residuals. It could be a arbitrary distribution.

References:

  • Olson, Inference on networks of mixtures for robust robot mapping
  • Engel, Large-Scale Direct SLAM and 3D Reconstruction in Real-Time
  • Barron, A General and Adaptive Robust Loss Function
  • Chebrolu, Adaptive Robust Kernels for Non-Linear Least Squares Problems
  • Pfeifer, https://www.tu-chemnitz.de/etit/proaut/en/research/libRSF.html
This blog is converted from paper-reading-mixture-models-for-least-square-optimization.ipynb
Written on May 11, 2021