Expectation–Maximization Algorithm

Introduction

Suppose we observe data $\mathcal{X}$ generated from a model $p$ with parameters $\theta$. We would like to estimate the parameter by maximizing log likelihood: $$ \hat{\theta} = \underset{\theta}{\operatorname{argmax}}log(p(\mathcal{X} | \theta)) $$ Suppose we have hidden variables $Z$, then $$ p(\mathcal{X} | \theta) = \sum_z p(\mathcal{X},z | \theta) $$

Note that this likelihood in general is non-convex with many local maxima. However, if we can observe $\mathcal{X}$ and $Z$, and $p(x_n,z_n|\theta)$ is a exponential distribution like Gaussian, by taking the log, we can get a convex log-likelihood function. The optimization will be much easier.

Expectation Maximization is ...

A Lower Bound of the Log-Likelihood

Suppose we have a distribution of $z$ as $q(z)$. Then we define $L(q, \theta)$ as below

$$ \begin{align} L(q, \theta) =& q(z)\sum_z log(\frac{p(X,z | \theta)}{q(z)}) \\ =& \sum_z q(z)log(p(X,z | \theta)) - \sum_z q(z)log(q(z)) \\ =& E_{q(z)}(log(p(X,Z | \theta))) + H(q) \end{align} $$

Then we calculate the KL divergence between $q(z)$ and $p(Z|X,\theta)$,

$$ \begin{align} D_{KL}(q, p(Z|X, \theta)) =& H(q, p(Z|X, \theta) - H(q) \\ =& E_q[-log(\frac{p(X,Z | \theta)}{P(X | \theta)}] - H(q) \\ =& -E_q[log(p(X,Z | \theta)] + E_q[log(P(X | \theta)] - H(q) \\ =&l( X |\theta) - L(q, \theta) \end{align} $$

Because $D_{KL}(q, p(Z|X, \theta)) \geq 0$, then we can get a lower bound of $l(\theta|X)$ which is $L(q, \theta)$.

If we want to maximize the lower bound $L(q, \theta)$, we need to set $D_{KL}(q, p(Z|X, \theta)) = 0$. Then we pick $q(z)$ as: $$ q_{\theta}(z) = P(Z | X, \theta) $$

Expectation-Maximization Algorithm

Expectation-Maximization is an iterative process like below. We first get the expectation with respect to $q_{\theta_{t}}(z)$, then maximize it with respect to $\theta$.

$$ \theta_{t+1} = \underset{\theta}{\operatorname{argmax}}E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))] $$

In some literatures, it is broken down into two parts:

  1. Find the expectation $E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))$ where $q_{\theta_t} = \underset{q_v}{\operatorname{argmax}}L(q_v, \theta_t)$. This is solved by minimizing the KL divergence $D_{KL}(q, p(Z|X, \theta))$ and we get $q_{\theta_t}(z) = P(Z | X, \theta_t)$

  2. Solve the maximization $\underset{\theta}{\operatorname{argmax}}E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))]$

!!! Note !!!: To make EM work, there are two requirements

  1. Hidden posterior $p(Z|X,\theta)$ is tractable so that we can easily get expectation with respect to $q_{\theta_t}(z) = P(Z | X, \theta_t)$.

  2. $p(Z, X|\theta)$ is a exponential distribution and log-likelihood function is convex.

Proof of Convergence

$$ \begin{align} l(X|\theta_t) =& L(q_{\theta_t}, \theta_t) \text{ when $D_{KL}(q, p(Z|X, \theta)) = 0$ }\\ \leq & L(q_{\theta_t}, \theta_{t+1}) \\ \leq & l(q_{\theta_{t+1}}, \theta_{t+1}) \end{align} $$

Examples

Coin Flip

Karl Rosaen's blog post has a very intuitive way to explain how EM works for a coin flip experiment. Here we will be more strictly following pure math.

Suppose we have two coins with different heads probability $\theta_A$ and $\theta_B$ and they are unknown to us. We collect data from a series of $N$ trials to estimate the bias of each coin. Each trial $k$ consists of flipping the same random coin $Z_k$ a total of $M$ times and recording only the total number $X_k$ of heads.

Mathematically,

$$ \theta = (\theta_A, \theta_B) \\ Z_k \sim Uniform(A, B) \\ P(X_k | Z_k, \theta) \sim B(M, \theta_{Z_k}) $$

When we want to tackle a problem using EM, we need first think about two probability distribution, $p(Z | X, \theta)$ and $p(Z, X | \theta)$.

1. For $p(Z, X |\theta)$:

$$ p(Z, X |\theta) = p(X|Z,\theta) p(Z|\theta) = p(X|Z,\theta) p(Z) $$

Then we can get: $$ \begin{align} p(Z=A, X=X_k |\theta) =& p(X=X_k|Z=A,\theta) p(Z=A) \\ =& \frac{1}{2}\begin{pmatrix}M\\X_k\end{pmatrix}\theta_A^{X_k}(1-\theta_A)^{M-X_k} \\ p(Z=B, X=X_k |\theta) =& p(X=X_k|Z=B,\theta) p(Z=B) \\ =& \frac{1}{2}\begin{pmatrix}M\\X_k\end{pmatrix}\theta_B^{X_k}(1-\theta_B)^{M-X_k} \end{align} $$

2. For $p(Z | X, \theta)$:

$$ q_k(Z=A) = p(Z = A | X=X_k, \theta) = \frac{P(Z=A|\theta) * P (X=X_k | Z=A, \theta)}{p(X=X_k|\theta)} \\ q_k(Z=B) = p(Z = B | X=X_k, \theta) = \frac{P(Z=B|\theta) * P (X=X_k | Z=B, \theta)}{p(X=X_k|\theta)} $$

Where $P(Z=A|\theta) = P(Z=B|\theta) = \frac{1}{2}$, $P (X=X_k | Z=A, \theta) = \begin{pmatrix}M\\X_k\end{pmatrix}\theta_A^{X_k}(1-\theta_A)^{M-X_k}$, $P (X=X_k | Z=B, \theta) = \begin{pmatrix}M\\X_k\end{pmatrix}\theta_B^{X_k}(1-\theta_B)^{M-X_k}$

Next we will start to run EM algorithm iteratively:

$$ \theta_{t+1} = \underset{\theta}{\operatorname{argmax}}E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))] $$

For a single trial, we get:

$$ \begin{align} E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))] =& q_k(Z=A)log(p(X=X_k, Z=A | \theta)) + q_k(Z=B) log(p(X=X_k, Z=B | \theta)) \\ \propto& q_k(Z=A)[X_klog(\theta_A)+(M-X_k)log(1-\theta_A)] + q_k(Z=B)[X_klog(\theta_B)+(M-X_k)log(1-\theta_B)] \end{align} $$

Since we have $N$ trials:

$$ \begin{align} E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))] =& E_{q_{\theta_t}}[log\prod_n^N{p(X=x_n, Z=z_n|\theta)}]\\ =& \sum_{n}^{N}E_{q_{\theta_t}}[log(p(X=x_n, Z=z_n|\theta)] \end{align} $$

To maximize $E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))]$, we will take the derivative with respect to $\theta_A$ and $\theta_B$, and set them to be $0$.

$$ \theta_A = \frac{\sum_n^N{a_n x_n}}{\sum_n^N{a_n M}} \\ \theta_B = \frac{\sum_n^N{b_n x_n}}{\sum_n^N{b_n M}} $$

where $a_n = q_n(Z=A)$, $b_n = q_n(Z=B)$, $a_n+b_n=1$

In [358]:
GT theta_A: 0.1, GT theta_B: 0.7
iter: 0, theta_A: 0.60, theta_B: 0.70, lower bounds: -2831.16
iter: 1, theta_A: 0.21, theta_B: 0.71, lower bounds: -452.67
iter: 2, theta_A: 0.10, theta_B: 0.70, lower bounds: -265.30
It converged...
t_A: 0.10, t_B: 0.70

Gaussian Mixture Model

A Gaussian mixture model can be defined as:

$$ p(x) \sim \sum_k^K\pi_k N(\mu_k, \Sigma_k) $$

For using GMM for clustering, the full model can be defined as

$$ \theta = (\pi, \mu, \Sigma) \\ Z_k \sim Categorical(\pi) \\ p(X_k | Z_k, \theta) \sim N(\mu_{z_k}, \sigma_{z_k}) $$

When we want to tackle a problem using EM, we need first think about two probability distribution, $p(Z | X, \theta)$ and $p(Z, X | \theta)$.

1. For $p(Z, X |\theta)$:

$$ \begin{align} p(Z_k, X_k | \theta) =& p(X_k|Z_k,\theta) p(Z_k|\theta) \\ =& p(X_|Z_k,\theta) p(Z_k | \theta) \\ =& \pi_{Z_k}N(X_k | \mu_{Z_k}, \Sigma_{Z_k}) \end{align} $$

2. For $p(Z | X, \theta)$:

$$ \begin{align} p(Z_k | X_k, \theta) =& \frac{P(Z_k|\theta) * P (X_k | Z_k, \theta)}{p(X_k|\theta)} \\ &= \frac{\pi_{Z_k}N(X_k | \mu_{Z_k}, \Sigma_{Z_k})}{\sum_{Z_k}\pi_{Z_k}N(X_k | \mu_{Z_k}, \Sigma_{Z_k})} \end{align} $$

Next we will start to run EM algorithm iteratively:

$$ \theta_{t+1} = \underset{\theta}{\operatorname{argmax}}E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))] $$

For a single iteration with $N$ data points, we get: $$ \begin{align} E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))] =& \sum_n^N \sum_k^K p(z_n = k | x_n, \theta_t) log(\pi_kN(x_n | \mu_k, \Sigma_k)) \\ =& \sum_n^N \sum_k^K r_{nk} log(\pi_k) + \sum_n^N \sum_k^K r_{nk} log(N(x_n | \mu_k, \Sigma_k)) \end{align} $$ where $r_{nk} = p(z_n = k | x_n, \theta_t)$

By maximizing $E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))]$, we get $$ \begin{align} \pi_k =& \frac{\sum_n^N r_{nk}}{N} \\ \mu_k =& \frac{\sum_n^N r_{nk}x_n}{\sum_n^N r_{nk}} \\ \Sigma_k =&\frac{\sum_n^N r_{nk}(x_n- \mu_k) (x_n- \mu_k)^T}{\sum_n^N r_{nk}} \\ =&\frac{\sum_n^N r_{nk}x_n x_n^T}{\sum_n^N r_{nk}} - \mu_k \mu_k^T \end{align} $$

In [373]:
iter: 0, lower bounds: -23637.87
iter: 1, lower bounds: -18410.67
iter: 2, lower bounds: -18346.46
iter: 3, lower bounds: -18346.46
iter: 4, lower bounds: -18346.46
It converged...
Out[373]:
<matplotlib.legend.Legend at 0x1473679a0>

Summary

All you need to do is to find two probability distribution below:

$$ p(Z_k, X_k |\theta) = p(X_k|Z_k,\theta) p(Z_k|\theta) $$$$ \begin{align} p(Z_k | X_k, \theta) =& \frac{P(Z_k|\theta) * P (X_k | Z_k, \theta)}{p(X_k|\theta)} \\ =& \frac{p(Z_k, X_k |\theta)}{p(X_k|\theta)} \\ =& \frac{p(Z_k, X_k |\theta)}{\sum_{Z_k}p(Z_k, X_k |\theta)} \end{align} $$

Then iteratively calculate:

$$ \theta_{t+1} = \underset{\theta}{\operatorname{argmax}}E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))] $$

Reference:

  • Benjamin Bray, Expectation Maximization
This blog is converted from expectation–maximization.ipynb
Written on May 10, 2021