Search
At the very beginning of the recognition labs, we assumed the class probability densities $p_{X|k}(x|k)$ and the a priori probabilities $p_K(k)$ to be known and we used them to find the optimal Bayesian strategy. Later, we looked at problems where the a priori probability is not known or does not exist and we constructed the optimal minimax strategy. Today, we face the problem of unknown probability density function $p_{X|k}(x|k)$ and we will be interested in its estimation.
In this lab we will assume that we know the density function is from a class of functions characterized by several parameters, and we will be dealing with estimating their optimal value. In particular, we will consider two cases:
To estimate the parameters we will use a training set - a set of examples drawn randomly from the particular distribution.
We will implement and evaluate three parameter estimation methods: Maximum likelihood (ML), Maximum a posteriori (MAP), and Bayesian inference.
The assignment has three parts. In the first one we will apply all the methods to the Normal distribution, in the second to the Categorical distribution and in the third part we will build a Bayes classifier using the estimated densities.
General information for Python development.
To fulfill this assignment, you need to submit these files (all packed in one .zip file) into the upload system:
.zip
mle_map_bayes.ipynb
mle_map_bayes.py
ml_estim_normal
ml_estim_categorical
map_estim_normal
map_estim_categorical
bayes_posterior_params_normal
bayes_posterior_params_categorical
bayes_estim_pdf_normal
bayes_estim_categorical
mle_Bayes_classif
map_Bayes_classif
bayes_Bayes_classif
mle_normal_derivation.jpg
map_normal_derivation.jpg
mle_categorical_derivation.jpg
map_categorical_derivation.jpg
mle_normal.png
mle_normal_varying_dataset_size.png
map_prior_normal.png
mle_map_prior_comparison_normal.png
mle_map_normal_dataset_sizes.png
bayes_posterior_normal.png
mle_map_bayes_normal.png
noise.png
categorical_data.png
mle_categorical.png
map_categorical.png
bayes_posterior_categorical.png
bayes_categorical.png
mle_map_bayes_Bayes_classifier.png
Use template of the assignment. When preparing a zip file for the upload system, do not include any directories, the files have to be in the zip file root.
bayes.py
At first, we will use synthetic data to examine the basic properties of different estimates. Start by generating some i.i.d. data $X=\{x_1, \ldots, x_N\}$ from a unimodal Normal distribution with $\mu=0$ and $\sigma^2=1$, i.e. $x_i \sim \mathcal{N}(\mu, \sigma^2)$. For most demonstrations $N=50$ data points will be enough. But feel free to experiment with larger datasets.
As you know from the lecture, our goal in MLE is to find a solution to this problem: $$\mu_*, \sigma_* = argmax_{\mu, \sigma} p(X| \mu, \sigma) = argmax_{\mu, \sigma} \prod_{i=1}^N p(x_i| \mu, \sigma)\;,$$ where the likelihood $p(x_i|\mu, \sigma)$ is the Normal distribution $\mathcal{N}(x_i|\mu, \sigma)$.
mu_mle, var_mle = ml_estim_normal(np.array([1, 4, 6, 3, 4, 7])) # -> 4.166666666666667, 3.805555555555556
plot_likelihood
plt.subplots
As you know from the lecture, in the MAP estimation the parameters are found by maximizing the a posteriori probability:
$$\theta_* = {\rm argmax}_{\theta} p(\theta | X) = {\rm argmax}_{\theta} p(X| \theta) p(\theta)$$
The advantage is that we can specify our prior knowledge about the true values of the parameters in the form of the prior $p(\theta)$ and thus avoid wrong uninformed estimates in case too few samples are available.
It was argued in the lecture that in order to obtain a close-form expression for a posterior probability, one should use a conjugate prior. In the lecture when we were estimating $\theta = \mu$ only and $\sigma^2$ was assumed to be known, a Normal distribution prior was used, which simplified the derivation significantly. In our current case, we assume $\theta = (\mu, \sigma^2)$, i.e. that both $\mu$ and $\sigma^2$ are unknown. Fortunately, even in this case there exists a conjugate prior. Out of several possible choices we will use the normal inverse gamma distribution:
$$p(\mu, \sigma^2) = \mbox{NIG}(\mu, \sigma^2 | \mu_0, \nu, \alpha, \beta) = \frac{\sqrt{\nu}}{\sigma \sqrt{2\pi}} \frac{\beta^\alpha}{\Gamma(\alpha)} \left(\frac{1}{\sigma^2}\right)^{\alpha+1} \exp\left[-\frac{2\beta + \nu(\mu_0 - \mu)^2}{2\sigma^2}\right]$$
where $\mu_0, \nu, \alpha$ and $\beta$ are the distribution parameters and $\Gamma(\cdot)$ is the Gamma function which is implemented in scipy.special.gamma.
scipy.special.gamma
A consequences of this choice, which we will explore in more detail in the next section on Bayesian inference, is that the a posteriori probability is again NIG, just with different parameters.
It is interesting to get the intuition behind the prior parameters. One can think of them as if before any data have been seen, the mean $\mu$ was estimated from $\nu$ observations with sample mean $\mu _{0}$ and the variance $\sigma^2$ was estimated from $2\alpha$ observations with sample mean $\mu _{0}$ and sum of squared deviations $2\beta$.
Alternatively, we may be interested in providing a prior values for $\mu$ and $\sigma^2$ (denoted $\mu_{my}, \sigma_{my}^2$) and then control the shape of the distribution around this peak. In this case, we set $\mu_0 = \mu_{my}$ and the parameters $\alpha$ and $\beta$ are linked by $\beta = \sigma_{my}^2 * (\alpha + 1.5)$, while $\nu$ stays a free parameter (this comes from the equations of the mode of the NIG distribution). So, using this parametrization we would specify e.g. $\mu_{my}, \sigma_{my}^2, \alpha$ and $nu$ instead of the original four parameters.
The prior pdf is implemented in norm_inv_gamma_pdf. Go and check that the implementation fits the formulas listed above.
norm_inv_gamma_pdf
Tasks:
plot_prior
mu_map, var_map = map_estim_normal(np.array([2, 7, 3, 5, 1, 0]), mu0=0, nu=5, alpha=1, beta=2.5) # -> (1.6363636363636365, 5.77685950413223)
In the Bayesian inference, we are interested in computing a posteriori distribution $p(\mu, \sigma^2 | X)$ over possible parameter values. We use the Bayes' rule:
\begin{align*} p(\mu, \sigma^2 | X, \mu_0, \nu, \alpha, \beta) &= \frac{\prod_{i=1}^N p(x_i|\mu, \sigma^2) p(\mu, \sigma^2)}{p(X)} \\ &=\frac{\prod_{i=1}^N \mathcal{N}(x_i|\mu, \sigma^2) \mbox{NIG}(\mu, \sigma^2 | \mu_0, \nu, \alpha, \beta)}{p(X)} \\ &= \frac{\kappa \; \mbox{NIG}(\mu, \sigma^2 | \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta})}{p(X)} \end{align*}
where the conjugate relationship between the likelihood and prior has been used and $\kappa$ is the associated normalization constant. The parameters of the new normal inverse gamma distribution can be shown to be
\begin{align*} \tilde{\alpha} = \alpha + N/2 \qquad \tilde{\nu} = \nu + N \qquad \tilde{\mu}_0 = \frac{\nu \mu_0 + \sum_i x_i}{\nu + N} \\ \tilde{\beta} = \beta + \frac{\sum_i x_i^2}{2} + \frac{\nu \mu_0^2}{2} - \frac{(\nu \mu_0 + \sum_i x_i)^2}{2(\nu + N)} \end{align*}
Notice further, that the a posteriori is a distribution, i.e. it integrates to one, so also the right hand side of the equation must be a distribution and thus the constant $\kappa$ and the denominator must exactly cancel out. So, thanks to the conjugacy of the prior, we again arrive at a close-form solution to the a posteriori distribution:
$$ p(\mu, \sigma^2 | X, \mu_0, \nu, \alpha, \beta) = \mbox{NIG}(\mu, \sigma^2 | \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) $$
Task:
plot_posterior_normal
Notice, that the graph looks exactly the same as for the MAP objective we created above. This is due to the normalisation of the plots in matplotlib. However, as notices above, the MAP objective is not a probability distribution as it does not integrate to one. On the other hand, the a posteriori $p(\mu, \sigma^2 | X, \mu_0, \nu, \alpha, \beta)$ is a correct distribution which integrates to one.
One interpretation of this graph is that, truly, the most likely solution is the MAP estimate, but there are also other possible solutions which only have smaller probability. What we want to do is to take them into account when making the prediction of the estimated distribution.
Predictive density
In the Bayesian inference we do not take only the most probable estimate of $\mu$ and $\sigma^2$ and use it for evaluation of $p(x|\mu, \sigma^2)$, but we compute a weighted average of the predictions at a test point $x^*$ for each possible parameter settings, where the weighting is given by the above posterior distribution.
\begin{align*} p(x^* | X, \mu_0, \nu, \alpha, \beta) &= \int \int p(x^* | \mu, \sigma^2) p(\mu, \sigma^2 | X, \mu_0, \nu, \alpha, \beta) d\mu d\sigma \\ &= \int \int \mathcal{N}(x^*|\mu, \sigma^2) \mbox{NIG}(\mu, \sigma^2 | \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) d\mu d\sigma \\ &= \int \int \kappa(x^*, \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) \mbox{NIG}(\mu, \sigma^2 | \breve{\mu}_0, \breve{\nu}, \breve{\alpha}, \breve{\beta}) d\mu d\sigma \end{align*}
where we used the conjugate relation for the second time (adopted from Prince book). We will call this distribution predictive density. Taking the constant out of the integrals we get
\begin{align*} p(x^* | X, \mu_0, \nu, \alpha, \beta) &= \kappa(x^*, \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) \int \int \mbox{NIG}(\mu, \sigma^2 | \breve{\mu}_0, \breve{\nu}, \breve{\alpha}, \breve{\beta}) d\mu d\sigma \\ &= \kappa(x^*, \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) \end{align*}
as the integral over the distribution is one and the constant $\kappa$ is given by
$$ \kappa(x^*, \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) = \frac{1}{\sqrt{2\pi}} \frac{\sqrt{\tilde{\nu}}\tilde{\beta}^{\tilde{\alpha}}}{\sqrt{\breve{\nu}}\breve{\beta}^{\breve{\alpha}}} \frac{\Gamma(\breve{\alpha})}{\Gamma(\tilde{\alpha})} $$
and
\begin{align*} \breve{\alpha} = \tilde{\alpha} + 1/2 \qquad \breve{\nu} = \tilde{\nu} + 1 \\ \breve{\beta} = \frac{{x^*}^2}{2} + \tilde{\beta} + \frac{\tilde{\nu}\tilde{\mu}_0^2}{2} - \frac{(\tilde{\nu}\tilde{\mu}_0 + x^*)^2}{2(\tilde{\nu} + 1)} \end{align*}
Next, lets look at estimating parameters of a categorical distribution. Suppose that our dataset $X=\{x_1, \ldots, x_N\}$ consists of points $x_i \in \{1, 2, \ldots, K\}$, like if we were throwing a biased dice or estimating preferences of $K$ presidential candidates. The categorical distribution is defined as
$$ p(x = k| p_1, \ldots, p_K) = p_k \qquad \mbox{where} \quad \sum_{k=1}^K p_k = 1$$
For the ML and MAP we estimate the parameters $\{p_k\}_{k=1}^K$, for the Bayesian approach we compute a probability distribution over the parameters.
As a prior needed for MAP estimation we choose the Dirichlet distribution as it is conjugate to the categorical likelihood
$$ \mbox{Dir}(p_1, \ldots, p_K; \alpha_1, \ldots, \alpha_K) = \frac{1}{B(\alpha)} \prod_{k=1}^K p_k^{\alpha_k - 1} $$
where $B(\alpha)$ is a normalization constant and $\alpha = (\alpha_1, \ldots, \alpha_K)$. It is implemented in numpy.random.dirichlet.
numpy.random.dirichlet
In the Bayesian approach we calculate a posteriori probability over the parameters (derivations adopted from Prince book)
\begin{align*} p(p_1, \ldots, p_6| X) &= \frac{\prod_{i=1}^N p(x_i|p_1, \ldots, p_6) p(p_1, \ldots, p_6)}{p(X)} \\ &= \frac{\prod_{i=1}^N \mbox{Cat}(x_i|p_1, \ldots, p_6) \mbox{Dir}(p_1, \ldots, p_6; \alpha_1, \ldots, \alpha_6)}{p(X)} \\ &= \frac{\kappa(\alpha_{1},\ldots, \alpha_{6}, X) \mbox{Dir}(p_1, \ldots, p_6; \tilde{\alpha}_1, \ldots, \tilde{\alpha}_6)}{p(X)} \\ &= \mbox{Dir}(p_1, \ldots, p_6; \tilde{\alpha}_1, \ldots, \tilde{\alpha}_6) \end{align*}
where $\tilde{\alpha}_k = N_k + \alpha_k$ and where we have once again exploited the conjugate relationship between the likelihood and prior. And as before, the normalization constant $\kappa$ cancels out with the denominator in order to obtain a valid probability distribution.
The predictive density used in ML and MAP was simply the categorical distribution with the estimated parameters $\hat{p}_k$. In the Bayesian case, we compute a weighted average of the predictions for each possible parameter set, where the weighting is given by the posterior distribution.
\begin{align*} p(x^*|X) &= \int p(x^*|p_1, \ldots, p_6) p(p_1, \ldots, p_6| X) dp_{1\ldots 6} \\ &= \int \mbox{Cat}(x^*|p_{1\ldots 6}) \mbox{Dir}(p_{1\ldots 6}; \tilde{\alpha}_{1\ldots 6}) dp_{1\ldots 6} \\ &= \int \kappa(x^*, \tilde{\alpha}_{1\ldots 6}) \mbox{Dir}(p_{1\ldots 6}; \breve{\alpha}_{1\ldots 6}) dp_{1\ldots 6} \\ &= \kappa(x^*, \tilde{\alpha}_{1\ldots 6}) \end{align*}
and it can be shown that
$$ p(x^*=k|X) = \kappa(x^*, \tilde{\alpha}_{1\ldots 6}) = \frac{N_k + \alpha_k}{\sum_{j=1}^6 (N_j + \alpha_j)} $$
In this part we will use the MLE, MAP and Bayesian estimates for the Normal distribution to build a Bayes classifier as in the previous assignment. Feel free to use your code from the previous lab to finish this task. However, upload every function needed so that we can evaluate it automatically.
The task is again to distinguish the images of letters 'A' from images of 'C' using the compute_measurement_lr_cont measurement.
compute_measurement_lr_cont
Spend some time specifying a meaningful prior for MAP and Bayesian estimates for both classes. The old values (0, 1) will no longer work here as the mean and variance of the data is different (in thousands)! It is recommended to plot the prior for a visual feedback for the parameters.
Simon J.D. Prince. Computer Vision: Models, Learning, and Inference. Cambridge University Press, 2012