Search
Famed investigator Nick Carter got into trouble when solving the case of the mysterious disappearance of lady Thuns's dog. The only clue he was able to obtain was a set of photographs of the likely villain. Unfortunately, the villain is nearly indecipherable in the photographs. Nick is just a detective, so he needs your help to recover the image of the villain from the corrupted photos. He has watched too many episodes of CSI and thinks it can be done. Can it?
So the technically challenged Nick has $n$ corrupted images of $H \times W$ pixels (Figure 2). He knows that each image depicts the face of the villain and that the faces occupy a subwindow in each of the images. Specifically, the image size of a face is only $H \times w$ pixels, i.e, the face spans the height of the image but is narrower than the image. Thus only the horizontal position of the face varies in the images (i.e. we can assume that horizontal position $d$ is the only unknown). Nick also knows that the background of the image, the image area outside the face itself, has constant intensity $b$. See Figure 3. The images are very noisy (assume Gaussian noise distribution here).
Figure 2. One of the Nick's images
Figure 3. Image composition. The face is at an unknown horizontal position $d$.
You are given set of $n$ noisy images each with dimension $H \times W$ such that each image contains the same hidden face. The face is of fixed width $w = 36$ pixels and has height equal to its containing image $H$, but in each image it is translated by some unknown and different horizontal displacement. The challenge is to reconstruct the “most likely” image of the face from all the noisy views. “Most likely” can be directly interpreted as the maximum likelihood image of the face as will be shown below.
Assume the following model generated the image set $X = \{x_1,x_2,\ldots,x_n\}$ (i.e., the images that contain the hidden face):
For easier notation in the following derivation, let's group the parameters we wish to estimate as $\theta = (f,b,\sigma)$. Then the likelihood $\Lambda_c$ (the subscript c is used for complete data, i.e., both images and displacements are known) of observing image set $X =\{x_1,x_2,\ldots,x_n\}$ is \begin{align*} \Lambda_c(X,\{d_k\}_{k=1}^n;\theta) &= \prod_k P_{X,D}(x_k,d_k;\theta) \\ &= \prod_k P_{X \mid D}(x_k \mid d_k;\theta)P_{D}(d_k;\theta) \end{align*} and note that if we're in the supervised learning case, i.e., if we are given the images AND the locations of the face, we can directly maximize the log likelihood \begin{align*} \theta^* &= \arg\max_{\theta} \Lambda_c(X,\{d_k\}_{k=1}^n;\theta) \\ &= \arg\max_{\theta} \prod_k P_{X \mid D}(x_k \mid d_k;\theta)P_{D}(d_k;\theta) \\ &= \arg\max_{\theta} \sum_k \log\left( P_{X \mid D}(x_k \mid d_k;\theta)P_{D}(d_k;\theta) \right) \\ &= \arg\max_{\theta} \lambda_c(X,\{d_k\}_{k=1}^n;\theta). \end{align*} Since each of the background and face is corrupted by Gaussian noise, let's introduce some shorthand for the Gaussian so that the forthcoming discussion has less cumbersome notation: $$ g(x;\mu,\sigma) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}. $$ Let's explicitly specify the distribution of pixel intensities for an image $x_k$ given that we know where the face is, i.e., $P_{X \mid D}(x_k \mid d_k;\theta)$. We are given that the background and face are corrupted by Gaussian noise with the same variance $\sigma^2$. Let $x_k(i,j),f(i,j)$ be the intensity of the pixel at coordinate $(i,j)$ in the image and face respectively. Then, \begin{equation*} P_{X \mid D}(x_k \mid d_k;\theta) = \prod_{ij}\left\{ \begin{array}{ll} g(x_k(i,j);b,\sigma) & 1 \leq j < d_k \text{ or } d_k+w \leq j \leq W \\ g(x_k(i,j);f(i,j-d_k+1),\sigma) & \text{ otherwise } \end{array} \right. \end{equation*} $P_{D}(\cdot)$ is a probability mass function that gives the prior belief that the face is located at a particular displacement. At the start of the investigation, we have no information about where the face is. So initially we will randomly assign a probability to each possible displacement while ensuring that $P_{D}(\cdot)$ is a probability mass function. It should be clear how to maximize the log likelihood $\lambda_c(X,\{d\}_{i=1}^n;\theta)$ in the supervised learning context from lab 4. Unfortunately we are not given the displacements of the face in any image.
In the unsupervised learning context it is still possible to find the likelihood of observing image set $X = \{x_1,x_2,\ldots,x_n\}$ by treating $\{d_k\}_{k=1}^n$ as hidden variables and marginalizing over all possible displacements \begin{align*} \Lambda(X;\theta) &= \prod_k P_{X}(x_k;\theta) \\ &= \prod_k \sum_{d} P_{X,D}(x_k,d;\theta) \\ &= \prod_k \sum_{d} P_{X \mid D}(x_k \mid d;\theta)P_{D}(d;\theta). \end{align*} Remember that $d$ assumes the integral values $\{1,\ldots,W-w+1\}$. The marginal log likelihood is written as \begin{align*} \log \Lambda(X;\theta) &= \sum_k \log \sum_{d}P_{X \mid D}(x_k \mid d;\theta)P_{D}(d;\theta) \\ &= \lambda(X;\theta) \end{align*} However, $\lambda(X;\theta)$ is difficult to optimize. Note that there is a summation inside the log, which couples the parameters $\theta$, making $\lambda(X;\theta)$ difficult to maximize. EM iteratively maximizes a concave, easier-to-optimize lower bound for the marginal log likelihood $\lambda(X;\theta)$. A very good exposition and theoretical development on EM can be found here: EM notes of Dellaert (pdf)
EM alternates between estimating the distributions over the hidden variables $\{d_k\}_{k=1}^n$ and the unknown parameter vector $\theta$; i.e., it is a block-coordinate ascent. This sounds similar to k-means; however, instead of finding the best $d_k \in D$ given an estimate $\theta$ at each iteration, EM computes a distribution over the space $D$. Specifically, the E-step makes an estimate of the posteriors $P_{D \mid X}$ of the hidden parameters given the current guess of the model parameters $\theta^{(t)}$, and the M-step maximizes the expected complete log-likelihood, defined as \begin{align*} Q^{(t)}(\theta) &= \sum_{k,d} P_{D|X}(d \mid x_k;\theta^{(t)})\log P_{X,D}(x_k,d;\theta) \\ &= \mathbb{E}_{D|X} \left[\log P_{X,D}({x_1,x_2,\ldots,x_n},d;\theta)\right] \end{align*}
$P_{D \mid X}(d \mid x_k;\theta^{(t)})$ is the probability that the villain's face $f$ is located at displacement $d$ given the image $x_k$. Given $\theta^{(t)}$, calculate \begin{align*} P_{D \mid X}(d \mid x_k;\theta^{(t)}) &= \frac{P_{X \mid D}(x_k \mid d;\theta^{(t)})P_{D}(d;\theta^{(t)})}{P_X(x_k;\theta^{(t)})} \\ &= \frac{P_{X \mid D}(x_k \mid d;\theta^{(t)})P_{D}(d;\theta^{(t)})}{\sum_d P_{X \mid D}(x_k \mid d;\theta^{(t)})P_D(d;\theta^{(t)})} \end{align*} for each image $k$ over each possible displacement $d \in \{1\ldots W-w+1\}$. Note that given any image $x_k$ $$ \sum_d P_{D \mid X}(d \mid x_k;\theta^{(t)}) = 1. $$ A good way to implement this is to store the probabilities of $P_{D \mid X}$ in a 2D look-up table, e.g., define the Matlab array $$\text{Pdx(d,k)} = P_{D \mid X}(d \mid x_k;\theta^{(t)}).$$ This definition will make the Matlab array column stochastic. If $t=0$, there is no valid estimate $\theta^{(0)}$, so randomly initialize $P_{D \mid X}(d \mid x_k;\theta^{(t)})$ enforcing the constraint that $\text{Pdx(d,k)}$ is column stochastic. To prevent underflow, see the section Posterior Calculation. To calculate the priors on displacement, marginalize the images \begin{align} P_D(d) &= \sum_k P_{X,D}(x_k,d) \\ &= \sum_k P_{D \mid X}(d \mid x_k;\theta^{(t)})P_X(x_k) \\ &= \frac{1}{n}\sum_k P_{D \mid X}(d \mid x_k;\theta^{(t)}) \end{align}
Given $P_{D \mid X}(d_k \mid x_k;\theta^{(t)})$, which was estimated in the E step, the likelihood of the parameters of interest $\theta$ can be maximized by solving for $\theta$ such that gradient of $\nabla Q^{(t)}(\theta) = 0$ (and verifying that it is a maximum). Stated as an optimization criterion, the M step is $$\theta^{(t+1)} = \arg\max_{\theta}Q^{(t)}(\theta).$$
The update equations needed to calculate $\theta^{(t+1)}$ that are gotten from the constraint $\nabla Q^{(t)}(\theta) = 0$ are provided for you: \begin{align*} f(i,j) &=\frac{1}{n} \sum_{k=1}^{N} \sum_{d=1}^{W-w+1} P_{D|X}(d \mid x_k ; \theta^{(t)}) x_k(i, d+j-1), \\ b &= \frac{1}{nH(W-w)}\sum_{k,d} P_{D \mid X}(d \mid x_k;\theta^{(t)}) \sum_i\sum_{j \notin \{ d,\ldots,d+w-1\} } x_k(i,j), \\ \sigma^2 &= \frac{1}{nHW}\sum_{k,d} P_{D \mid X}(d \mid x_k;\theta^{(t)})(\Delta f(k,d)+\Delta b(k,d)), \\ \end{align*} where \begin{align*} \Delta f(k,d_k) &= \sum_i\sum_{j \in \{1,\ldots,w\} } (x_k(i,j+d_k-1)-f(i,j))^2, \\ \Delta b(k,d_k) &= \sum_i\sum_{j \notin \{ d_k,\ldots,d_k+w-1\} } (x_k(i,j)-b)^2. \\ \end{align*} For formulas derivation see [5].
We will test for the idempotency (as defined by some tolerance) of the the marginal log-likelihood we are maximizing to determine when to stop the EM loop. Calculating the marginal log-likelihood naively could result in underflow. A correct way of handling this issue was implemented for you in the calc_mllh.m function provided in the template. Review the code in calc_mllh.m to understand how to correctly calculate the marginal log likelihood. The method to prevent underflow is explained in the section logsumexp trick.
calc_mllh.m
To fulfil this assignment, you need to submit these files (all packed in a single .zip file) into the upload system:
.zip
answers.txt
assignment_11.m
identify_villain.m
run_EM.m
get_Pxd.m
e_step.m
L_10.png
m_step.m
face10.png
face100.png
face500.png
d1.png
d10.png
Start by downloading the template of the assignment. The included datafile contains 500 images of dimensions 45×60 pixels. The face's width $w$ is fixed to be 36 pixels.
Initialization for the hints in the following text:
>> X = [158, 202, 0, 0, 231, 242, 255; 0, 149, 255, 116, 255, 212, 208; 237, 0, 255, 36, 202, 0, 222] f = [163, 255; 156, 130; 255, 114] w = 2 b = 154.0476 s2 = 0.8165 d = 3 m = 1 Pd = [0.0625, 0.1875, 0.0625, 0.2500, 0.3125, 0.1250]' r = 2
Pxd = get_Pxd(X,f,b,s2,d)
>> get_Pxd(X, f, b, s2, d) ans = -1.4412e+05
[f,d,b,L] = run_EM(X,m)
m
>>[Pdx,new_Pd] = e_step(X, f, b, s2, Pd, m) Pdx = 0.0000 0 0 0 1.0000 0 new_Pd = 0.0000 0 0 0 1.0000 0 >> [f,b,s2] = m_step(X,f,b,s2,Pdx,1) f = 231 242 255 212 202 0 b = 139.5333 s2 = 7.3557e+03
run_EM(X,m)
run_EM
rng(12)
[optf,optd,optb,optL] = identify_villain(X,m,r)
imshow(uint8(f))
Figure 4. The maximum a posterior face position $d_1$ in the first image.
true
false
Recall that the marginal log-likelihood is $$ \log \Lambda(X;\theta) = \sum_k \log \sum_{d}P_{X \mid D}(x_k \mid d;\theta)P_{D}(d;\theta) $$ Notice that to calculate the marginal log-likelihood, you will need to exponentiate the log-likelihoods returned by get_Pxd.m. Direct calculation will result in underflow because log-likelihoods can be arbitrarily large negative numbers. This is where the logsumexp trick is needed. Suppose $$\alpha_d = \log P_{X \mid D}(x_k \mid d;\theta)+\log P_{D}(d;\theta)$$ and let $$d^* = \arg \max_d\{\alpha_d\}.$$ Then $$\log \sum_d e^{a_d} = \log \sum_d e^{a_d}e^{(\alpha_{d^*}-\alpha_{d^*})} = \alpha_{d^*}+ \log\sum_d e^{({a_d-\alpha_{d^*}})}$$ Note what has been accomplished by this trick: The log-likelihoods have been positively translated to mitigate problems of underflow during exponentiation. In addition, the biggest contributors to the likelihood, those values near $\alpha_{d^*}$, will have reasonable precision after exponentiation; they will be near 1.
In the E-step, you will estimate the posterior $$ P_{D \mid X}(d_k \mid x_k;\theta^{(t)}) = \frac{P_{X \mid D}(x_k \mid d_k;\theta^{(t)})P_{D}(d_k;\theta^{(t)})}{\sum_d P_{X \mid D}(x_k \mid d;\theta^{(t)})P_D(d;\theta^{(t)})}. $$ The log-likelihoods will be large negative numbers, so if you compute $P_{D \mid X}(d_k \mid x_k;\theta^{(t)})$ naively, the computation will experience underflow when you exponentiate and sum. The following expression is given to accurately estimate the posteriors. It is closely related to the logsumexp trick, $$ \frac{e^{a_k}}{\sum_d e^{a_d}} = \frac{e^{a_k-\max_d a_d}}{\sum_d e^{a_d-\max_d a_d}}. $$
[1] EM notes of Hoffman (pdf) [2] EM notes of Bishop (ppt) [3] EM notes of Dellaert (pdf) [4] EM notes of Tomasi with GMM example (pdf) [5] Adéla ještě nevečeřela