Expectation–Maximization Algorithm

Introduction ¶
A Lower Bound of the Log-Likelihood ¶
Expectation-Maximization Algorithm ¶
Proof of Convergence ¶
Examples ¶
- Coin Flip ¶
- Gaussian Mixture Model ¶
Summary ¶

Introduction ¶

Suppose we observe data $\mathcal{X}$ generated from a model $p$ with parameters $\theta$. We would like to estimate the parameter by maximizing log likelihood: $$ \hat{\theta} = \underset{\theta}{\operatorname{argmax}}log(p(\mathcal{X} | \theta)) $$ Suppose we have hidden variables $Z$, then $$ p(\mathcal{X} | \theta) = \sum_z p(\mathcal{X},z | \theta) $$

Note that this likelihood in general is non-convex with many local maxima. However, if we can observe $\mathcal{X}$ and $Z$, and $p(x_n,z_n|\theta)$ is a exponential distribution like Gaussian, by taking the log, we can get a convex log-likelihood function. The optimization will be much easier.

Expectation Maximization is ...

A Lower Bound of the Log-Likelihood ¶

Suppose we have a distribution of $z$ as $q(z)$. Then we define $L(q, \theta)$ as below

$$ \begin{align} L(q, \theta) =& q(z)\sum_z log(\frac{p(X,z | \theta)}{q(z)}) \\ =& \sum_z q(z)log(p(X,z | \theta)) - \sum_z q(z)log(q(z)) \\ =& E_{q(z)}(log(p(X,Z | \theta))) + H(q) \end{align} $$

Then we calculate the KL divergence between $q(z)$ and $p(Z|X,\theta)$,

$$ \begin{align} D_{KL}(q, p(Z|X, \theta)) =& H(q, p(Z|X, \theta) - H(q) \\ =& E_q[-log(\frac{p(X,Z | \theta)}{P(X | \theta)}] - H(q) \\ =& -E_q[log(p(X,Z | \theta)] + E_q[log(P(X | \theta)] - H(q) \\ =&l( X |\theta) - L(q, \theta) \end{align} $$

Because $D_{KL}(q, p(Z|X, \theta)) \geq 0$, then we can get a lower bound of $l(\theta|X)$ which is $L(q, \theta)$.

If we want to maximize the lower bound $L(q, \theta)$, we need to set $D_{KL}(q, p(Z|X, \theta)) = 0$. Then we pick $q(z)$ as: $$ q_{\theta}(z) = P(Z | X, \theta) $$

Expectation-Maximization Algorithm ¶

Expectation-Maximization is an iterative process like below. We first get the expectation with respect to $q_{\theta_{t}}(z)$, then maximize it with respect to $\theta$.

$$ \theta_{t+1} = \underset{\theta}{\operatorname{argmax}}E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))] $$

In some literatures, it is broken down into two parts:

Find the expectation $E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))$ where $q_{\theta_t} = \underset{q_v}{\operatorname{argmax}}L(q_v, \theta_t)$. This is solved by minimizing the KL divergence $D_{KL}(q, p(Z|X, \theta))$ and we get $q_{\theta_t}(z) = P(Z | X, \theta_t)$

Solve the maximization $\underset{\theta}{\operatorname{argmax}}E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))]$

!!! Note !!!: To make EM work, there are two requirements

Hidden posterior $p(Z|X,\theta)$ is tractable so that we can easily get expectation with respect to $q_{\theta_t}(z) = P(Z | X, \theta_t)$.

$p(Z, X|\theta)$ is a exponential distribution and log-likelihood function is convex.

Proof of Convergence ¶

$$ \begin{align} l(X|\theta_t) =& L(q_{\theta_t}, \theta_t) \text{ when $D_{KL}(q, p(Z|X, \theta)) = 0$ }\\ \leq & L(q_{\theta_t}, \theta_{t+1}) \\ \leq & l(q_{\theta_{t+1}}, \theta_{t+1}) \end{align} $$

Examples ¶

Coin Flip ¶

Karl Rosaen's blog post has a very intuitive way to explain how EM works for a coin flip experiment. Here we will be more strictly following pure math.

Suppose we have two coins with different heads probability $\theta_A$ and $\theta_B$ and they are unknown to us. We collect data from a series of $N$ trials to estimate the bias of each coin. Each trial $k$ consists of flipping the same random coin $Z_k$ a total of $M$ times and recording only the total number $X_k$ of heads.

Mathematically,

$$ \theta = (\theta_A, \theta_B) \\ Z_k \sim Uniform(A, B) \\ P(X_k | Z_k, \theta) \sim B(M, \theta_{Z_k}) $$

When we want to tackle a problem using EM, we need first think about two probability distribution, $p(Z | X, \theta)$ and $p(Z, X | \theta)$.

1. For $p(Z, X |\theta)$:

$$ p(Z, X |\theta) = p(X|Z,\theta) p(Z|\theta) = p(X|Z,\theta) p(Z) $$

Then we can get: $$ \begin{align} p(Z=A, X=X_k |\theta) =& p(X=X_k|Z=A,\theta) p(Z=A) \\ =& \frac{1}{2}\begin{pmatrix}M\\X_k\end{pmatrix}\theta_A^{X_k}(1-\theta_A)^{M-X_k} \\ p(Z=B, X=X_k |\theta) =& p(X=X_k|Z=B,\theta) p(Z=B) \\ =& \frac{1}{2}\begin{pmatrix}M\\X_k\end{pmatrix}\theta_B^{X_k}(1-\theta_B)^{M-X_k} \end{align} $$

2. For $p(Z | X, \theta)$:

$$ q_k(Z=A) = p(Z = A | X=X_k, \theta) = \frac{P(Z=A|\theta) * P (X=X_k | Z=A, \theta)}{p(X=X_k|\theta)} \\ q_k(Z=B) = p(Z = B | X=X_k, \theta) = \frac{P(Z=B|\theta) * P (X=X_k | Z=B, \theta)}{p(X=X_k|\theta)} $$

Where $P(Z=A|\theta) = P(Z=B|\theta) = \frac{1}{2}$, $P (X=X_k | Z=A, \theta) = \begin{pmatrix}M\\X_k\end{pmatrix}\theta_A^{X_k}(1-\theta_A)^{M-X_k}$, $P (X=X_k | Z=B, \theta) = \begin{pmatrix}M\\X_k\end{pmatrix}\theta_B^{X_k}(1-\theta_B)^{M-X_k}$

Next we will start to run EM algorithm iteratively:

$$ \theta_{t+1} = \underset{\theta}{\operatorname{argmax}}E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))] $$

For a single trial, we get:

$$ \begin{align} E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))] =& q_k(Z=A)log(p(X=X_k, Z=A | \theta)) + q_k(Z=B) log(p(X=X_k, Z=B | \theta)) \\ \propto& q_k(Z=A)[X_klog(\theta_A)+(M-X_k)log(1-\theta_A)] + q_k(Z=B)[X_klog(\theta_B)+(M-X_k)log(1-\theta_B)] \end{align} $$

Since we have $N$ trials:

$$ \begin{align} E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))] =& E_{q_{\theta_t}}[log\prod_n^N{p(X=x_n, Z=z_n|\theta)}]\\ =& \sum_{n}^{N}E_{q_{\theta_t}}[log(p(X=x_n, Z=z_n|\theta)] \end{align} $$

To maximize $E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))]$, we will take the derivative with respect to $\theta_A$ and $\theta_B$, and set them to be $0$.

$$ \theta_A = \frac{\sum_n^N{a_n x_n}}{\sum_n^N{a_n M}} \\ \theta_B = \frac{\sum_n^N{b_n x_n}}{\sum_n^N{b_n M}} $$

where $a_n = q_n(Z=A)$, $b_n = q_n(Z=B)$, $a_n+b_n=1$

In [358]:

# Coin flip experiment

%matplotlib inline
import numpy as np
import scipy.special as special
from matplotlib import pyplot as plt

def get_qn(theta_A, theta_B, M, nheads):
    #unnormalized_an = theta_A ** nheads * (1-theta_A) ** (M-nheads)
    #unnormalized_bn = theta_B ** nheads * (1-theta_B) ** (M-nheads)
    #an = unnormalized_an / (unnormalized_an + unnormalized_bn)
    #bn = 1-an
    log_an = nheads*np.log(theta_A) + (M-nheads)*np.log(1-theta_A)
    log_bn = nheads*np.log(theta_B) + (M-nheads)*np.log(1-theta_B)
    t = np.exp(log_an - log_bn)
    # an / bn = t, an + bn = 1
    # t*bn + bn = 1 => bn = 1 / (t+1)
    bn = 1.0 / (t+1) 
    an = t / (t+1)
    return an, bn

def get_lower_bound(theta_A, theta_B, M, nheads, an, bn):
    l_a = nheads*np.log(theta_A) + (M-nheads)*np.log(1-theta_A) + np.log(special.comb(M, nheads))
    l_b = nheads*np.log(theta_B) + (M-nheads)*np.log(1-theta_B) + np.log(special.comb(M, nheads))
    return an * l_a + bn * l_b - an*np.log(an) - bn*np.log(bn)

def coin_em(rolls, theta_A=np.random.rand(), theta_B=np.random.rand(), maxiter=10):
    theta_As = [theta_A] 
    theta_Bs = [theta_B]
    lbs = []
    for i in range(maxiter+1):
        numerator = [0,0]
        denominator = [0,0]
        lb = 0
        for roll in rolls: 
            M = len(roll)
            nheads = roll.count("H")
            an, bn = get_qn(theta_A, theta_B, M, nheads)
            numerator[0] += an*nheads
            denominator[0] += an*M
            numerator[1] += bn*nheads
            denominator[1] += bn*M  
            lb += get_lower_bound(theta_A, theta_B, M, nheads, an, bn)
        theta_A = numerator[0] / denominator[0]
        theta_B = numerator[1] / denominator[1] 
        theta_As.append(theta_A)
        theta_Bs.append(theta_B)
        lbs.append(lb)
        if len(lbs) >= 2 and np.abs(lbs[-2] - lbs[-1]) < 1E-8:
            print('It converged...')
            return theta_As[:-1], theta_Bs[:-1]
        print(f'iter: {i}, theta_A: {theta_As[-2]:.2f}, theta_B: {theta_Bs[-2]:.2f}, lower bounds: {lb:.2f}')
    return theta_As[:-1], theta_Bs[:-1] 
        

def plot_coin_likelihood(rolls, thetas_A, thetas_B):
    xvals = np.linspace(0.01,0.99,50)
    yvals = np.linspace(0.01,0.99,50)
    
    Z = np.zeros((len(xvals), len(yvals)))
    for i,x in enumerate(xvals):
        for j,y in enumerate(yvals):
            theta_A = x
            theta_B = y
            expectation = 0
            for roll in rolls: 
                M = len(roll)
                nheads = roll.count("H")
                an, bn = get_qn(theta_A, theta_B, M, nheads)
                expectation += get_expectation(theta_A, theta_B, M, nheads, an, bn)
            Z[i][j] = expectation
        
    plt.figure(figsize=(10,8))
    C = plt.contour(xvals,yvals,Z,150)
    cbar = plt.colorbar(C)
    plt.title(r"$E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta_t)]$", fontsize=20)
    plt.xlabel(r"$\theta_A$", fontsize=20)
    plt.ylabel(r"$\theta_B$", fontsize=20)
    
    plt.plot(thetas_A, thetas_B, '-k', lw=2.0)
    plt.plot(thetas_A, thetas_B, 'ok', ms=5.0)
    plt.plot(thetas_A[-1], thetas_B[-1], '*r', ms=5.0)

theta_A = 0.1
theta_B = 0.7
print(f'GT theta_A: {theta_A}, GT theta_B: {theta_B}')
N = 100
M = 100
rolls = []
selections = []
for n in range(N):
    p = np.random.uniform()
    if p < 0.5:
        theta = theta_A
        selections.append('A')
    else:
        theta = theta_B
        selections.append('B')
    roll = ""
    for m in range(M):
        p = np.random.uniform()
        if p < theta:
            roll+="H"
        else:
            roll+="T"
            print
    rolls.append(roll)

# print(selections)
# print(rolls)
t_As, t_Bs = coin_em(rolls, 0.6, 0.7, maxiter=6)
print(f't_A: {t_As[-1]:.2f}, t_B: {t_Bs[-1]:.2f}')

plot_coin_likelihood(rolls, t_As, t_Bs)
#

GT theta_A: 0.1, GT theta_B: 0.7
iter: 0, theta_A: 0.60, theta_B: 0.70, lower bounds: -2831.16
iter: 1, theta_A: 0.21, theta_B: 0.71, lower bounds: -452.67
iter: 2, theta_A: 0.10, theta_B: 0.70, lower bounds: -265.30
It converged...
t_A: 0.10, t_B: 0.70

Gaussian Mixture Model ¶

A Gaussian mixture model can be defined as:

$$ p(x) \sim \sum_k^K\pi_k N(\mu_k, \Sigma_k) $$

For using GMM for clustering, the full model can be defined as

$$ \theta = (\pi, \mu, \Sigma) \\ Z_k \sim Categorical(\pi) \\ p(X_k | Z_k, \theta) \sim N(\mu_{z_k}, \sigma_{z_k}) $$

When we want to tackle a problem using EM, we need first think about two probability distribution, $p(Z | X, \theta)$ and $p(Z, X | \theta)$.

1. For $p(Z, X |\theta)$:

$$ \begin{align} p(Z_k, X_k | \theta) =& p(X_k|Z_k,\theta) p(Z_k|\theta) \\ =& p(X_|Z_k,\theta) p(Z_k | \theta) \\ =& \pi_{Z_k}N(X_k | \mu_{Z_k}, \Sigma_{Z_k}) \end{align} $$

2. For $p(Z | X, \theta)$:

$$ \begin{align} p(Z_k | X_k, \theta) =& \frac{P(Z_k|\theta) * P (X_k | Z_k, \theta)}{p(X_k|\theta)} \\ &= \frac{\pi_{Z_k}N(X_k | \mu_{Z_k}, \Sigma_{Z_k})}{\sum_{Z_k}\pi_{Z_k}N(X_k | \mu_{Z_k}, \Sigma_{Z_k})} \end{align} $$

Next we will start to run EM algorithm iteratively:

$$ \theta_{t+1} = \underset{\theta}{\operatorname{argmax}}E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))] $$

For a single iteration with $N$ data points, we get: $$ \begin{align} E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))] =& \sum_n^N \sum_k^K p(z_n = k | x_n, \theta_t) log(\pi_kN(x_n | \mu_k, \Sigma_k)) \\ =& \sum_n^N \sum_k^K r_{nk} log(\pi_k) + \sum_n^N \sum_k^K r_{nk} log(N(x_n | \mu_k, \Sigma_k)) \end{align} $$ where $r_{nk} = p(z_n = k | x_n, \theta_t)$

By maximizing $E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))]$, we get $$ \begin{align} \pi_k =& \frac{\sum_n^N r_{nk}}{N} \\ \mu_k =& \frac{\sum_n^N r_{nk}x_n}{\sum_n^N r_{nk}} \\ \Sigma_k =&\frac{\sum_n^N r_{nk}(x_n- \mu_k) (x_n- \mu_k)^T}{\sum_n^N r_{nk}} \\ =&\frac{\sum_n^N r_{nk}x_n x_n^T}{\sum_n^N r_{nk}} - \mu_k \mu_k^T \end{align} $$

In [373]:

# GMM

%matplotlib inline
import numpy as np
import scipy.stats as stats
from matplotlib import pyplot as plt
import math

np.printoptions(precision=2)

pi_gt = [0.15, 0.85]
mu_gt= [0., 10.]
sigma_gt = [1., 1.]

assert(pi_gt[0] + pi_gt[1] == 1)
N = 10000
samples_A = []
samples_B = []
samples = []
for i in range(N):
    p = np.random.uniform()
    if p < pi_gt[0]:
        sample = np.random.normal(mu_gt[0], sigma_gt[0], 1)
        samples_A.append(sample)
    else:
        sample = np.random.normal(mu_gt[1], sigma_gt[1], 1)
        samples_B.append(sample)
    samples.append(sample)

pi = [0.3, 0.7]
mu = [2., 11.]
sigma = [1.6, 1.2]
assert(pi[0] + pi[1] == 1)
lbs = [] # lower bounds E_q + H(q)

def get_rnk(sample, pi, mu, sigma):
    l = np.array([np.log(pi_) - np.log(np.sqrt(2*math.pi)*sigma_) - 0.5 * (samples[n] - mu_)**2 / (sigma_**2) 
                  for pi_, mu_, sigma_ in zip(pi, mu, sigma)])
    l_ = l - np.max(l) # np.min(l) # use min or max?
    likelihood = np.exp(l_)
    r_nk = np.squeeze(likelihood / np.sum(likelihood))
    lb = np.sum(r_nk.reshape(-1,1) * l.reshape(-1,1) - (r_nk*np.log(r_nk)).reshape(-1,1)) # E_q + H(q)
    # lb = r_nk[0]*l[0]+r_nk[1]*l[1]-r_nk[1]*np.log(r_nk[1])-r_nk[0]*np.log(r_nk[0])
    return r_nk, lb

maxiter = 10  
for i in range(maxiter):
    new_pi = np.array([0., 0.])
    new_mu = np.array([0., 0.])
    new_var = np.array([0., 0.])
    lb_sum = 0
    for n in range(N):
        # Solve numerical issue when likelihood is very very small.
        r_nk, lb = get_rnk(samples[n], pi, mu, sigma)
        new_pi += r_nk
        new_mu += r_nk * samples[n]
        new_var += r_nk * samples[n] * samples[n]
        lb_sum += lb
    lbs.append(lb_sum)
    #print(f'iter: {i}, lower bounds: {b_sum.item():.2f}, pi: {pi}, mu: {mu}, sigma: {sigma}')
    print(f'iter: {i}, lower bounds: {lb_sum.item():.2f}')
    if len(lbs) > 1 and np.abs(lbs[-2] - lbs[-1]) < 1E-8:
        print('It converged...')
        break
    rk = new_pi
    pi = rk / N
    mu = new_mu / rk
    sigma = np.sqrt(new_var/rk - mu * mu)
    
plt.figure(figsize=(10,8))
plt.plot(samples_A, np.zeros(len(samples_A)), '.r', lw=2.0)
x_range = np.linspace(mu_gt[0]-3*sigma_gt[0], mu_gt[0]+3*sigma_gt[0], 100)
plt.plot(x_range, pi_gt[0]*stats.norm.pdf(x_range, mu_gt[0], sigma_gt[0]), 'r', label='GT pdf of red cluster')
plt.plot(x_range, pi[0]*stats.norm.pdf(x_range, mu[0], sigma[0]), 'r--', label='estimated pdf of red cluster')

plt.plot(samples_B, np.zeros(len(samples_B)), '.b', lw=2.0)
x_range = np.linspace(mu_gt[1]-3*sigma_gt[1], mu_gt[1]+3*sigma_gt[1], 100)
plt.plot(x_range, pi_gt[1]*stats.norm.pdf(x_range, mu_gt[1], sigma_gt[1]), 'b', label='GT pdf of blue cluster')
plt.plot(x_range, pi[1]*stats.norm.pdf(x_range, mu[1], sigma[1]), 'b--', label='eestimated pdf of blue cluster')


plt.legend()
#

iter: 0, lower bounds: -23637.87
iter: 1, lower bounds: -18410.67
iter: 2, lower bounds: -18346.46
iter: 3, lower bounds: -18346.46
iter: 4, lower bounds: -18346.46
It converged...

Out[373]:

<matplotlib.legend.Legend at 0x1473679a0>

Summary ¶

All you need to do is to find two probability distribution below:

$$ p(Z_k, X_k |\theta) = p(X_k|Z_k,\theta) p(Z_k|\theta) $$$$ \begin{align} p(Z_k | X_k, \theta) =& \frac{P(Z_k|\theta) * P (X_k | Z_k, \theta)}{p(X_k|\theta)} \\ =& \frac{p(Z_k, X_k |\theta)}{p(X_k|\theta)} \\ =& \frac{p(Z_k, X_k |\theta)}{\sum_{Z_k}p(Z_k, X_k |\theta)} \end{align} $$

Then iteratively calculate:

$$ \theta_{t+1} = \underset{\theta}{\operatorname{argmax}}E_{q_{\theta_{t}}(z)}[log(p(X, Z|\theta))] $$

Reference:

Benjamin Bray, Expectation Maximization

This blog is converted from expectation–maximization.ipynb

Written on May 10, 2021