At the very beginning of the recognition labs, we assumed the class probability densities $p_{X|k}(x|k)$ and the a priori probabilities $p_K(k)$ to be known and we used them to find the optimal Bayesian strategy. Later, we looked at problems where the a priori probability is not known or does not exist and we constructed the optimal minimax strategy. Today, we face the problem of unknown probability density function $p_{X|k}(x|k)$ and we will be interested in its estimation.
In this lab we will assume that we know the density function is from a class of functions characterized by several parameters, and we will be dealing with estimating their optimal value. In particular, we will consider two cases:
To estimate the parameters we will use a training set - a set of examples drawn randomly from the particular distribution.
We will implement and evaluate three parameter estimation methods: Maximum likelihood (ML), Maximum a posteriori (MAP), and Bayesian inference.
The assignment has three parts. In the first one we will apply all the methods to the Normal distribution, in the second to the Categorical distribution and in the third part we will build a Bayes classifier using the estimated densities.
General information for Python development.
To fulfill this assignment, you need to submit these files (all packed in one .zip
file) into the upload system:
mle_map_bayes.ipynb
- a notebook for data initialization, calling of the implemented functions and plotting of their results (for your convenience, will not be checked).
mle_map_bayes.py
- file with the following methods implemented:
ml_estim_normal
- function which computes a ML estimate of Normal distribution parameters
ml_estim_categorical
- function which computes a ML estimate of Categorical distribution parameters
map_estim_normal
- function which computes a MAP estimate of Normal distribution parameters
map_estim_categorical
- function which computes a MAP estimate of Categorical distribution parameters
bayes_posterior_params_normal
- function which computed the a posteriori distribution parameters using Bayesian inference for Normal distribution
bayes_posterior_params_categorical
- function which computed the a posteriori distribution parameters using Bayesian inference for Categorical distribution
bayes_estim_pdf_normal
- function which computes a Bayesian estimate for Normal distribution as a NIG probability density function
bayes_estim_categorical
- function which computes a Bayesian estimate of Categorical distribution parameters
mle_Bayes_classif
- function which classifies using ML estimate of normal distributions
map_Bayes_classif
- function which classifies using MAP estimate of normal distributions
bayes_Bayes_classif
- function which classifies using Bayes estimate of normal distributions
mle_normal_derivation.jpg
, map_normal_derivation.jpg
, mle_categorical_derivation.jpg
and map_categorical_derivation.jpg
- images of handwritten derivations
mle_normal.png
, mle_normal_varying_dataset_size.png
, map_prior_normal.png
, mle_map_prior_comparison_normal.png
, mle_map_normal_dataset_sizes.png
, bayes_posterior_normal.png
, mle_map_bayes_normal.png
, noise.png
, categorical_data.png
, mle_categorical.png
, map_categorical.png
, bayes_posterior_categorical.png
, bayes_categorical.png
and mle_map_bayes_Bayes_classifier.png
- images specified in the assignment
Use template of the assignment. When preparing a zip file for the upload system, do not include any directories, the files have to be in the zip file root.
bayes.py
in BRUTE, please copy all the necessary functions directly into mle_map_bayes.py
. Sorry for the inconvenience.
At first, we will use synthetic data to examine the basic properties of different estimates. Start by generating some i.i.d. data $X=\{x_1, \ldots, x_N\}$ from a unimodal Normal distribution with $\mu=0$ and $\sigma^2=1$, i.e. $x_i \sim \mathcal{N}(\mu, \sigma^2)$. For most demonstrations $N=50$ data points will be enough. But feel free to experiment with larger datasets.
As you know from the lecture, our goal in MLE is to find a solution to this problem: $$\mu_*, \sigma_* = argmax_{\mu, \sigma} p(X| \mu, \sigma) = argmax_{\mu, \sigma} \prod_{i=1}^N p(x_i| \mu, \sigma)\;,$$ where the likelihood $p(x_i|\mu, \sigma)$ is the Normal distribution $\mathcal{N}(x_i|\mu, \sigma)$.
ml_estim_normal
. mle_normal_derivation.jpg
(this is a good chance to see, how log-likelihood could be useful ;) )mu_mle, var_mle = ml_estim_normal(np.array([1, 4, 6, 3, 4, 7])) # -> 4.166666666666667, 3.805555555555556
plot_likelihood
to plot the likelihood for a range of $\mu$ and $\sigma^2$ values. Mark the true and estimated values. Next, plot the true and estimated distributions in one graph. Save the figure as mle_normal.png
.plt.subplots
. mle_normal_varying_dataset_size.png
.
As you know from the lecture, in the MAP estimation the parameters are found by maximizing the a posteriori probability:
$$\theta_* = {\rm argmax}_{\theta} p(\theta | X) = {\rm argmax}_{\theta} p(X| \theta) p(\theta)$$
The advantage is that we can specify our prior knowledge about the true values of the parameters in the form of the prior $p(\theta)$ and thus avoid wrong uninformed estimates in case too few samples are available.
It was argued in the lecture that in order to obtain a close-form expression for a posterior probability, one should use a conjugate prior. In the lecture when we were estimating $\theta = \mu$ only and $\sigma^2$ was assumed to be known, a Normal distribution prior was used, which simplified the derivation significantly. In our current case, we assume $\theta = (\mu, \sigma^2)$, i.e. that both $\mu$ and $\sigma^2$ are unknown. Fortunately, even in this case there exists a conjugate prior. Out of several possible choices we will use the normal inverse gamma distribution:
$$p(\mu, \sigma^2) = \mbox{NIG}(\mu, \sigma^2 | \mu_0, \nu, \alpha, \beta) = \frac{\sqrt{\nu}}{\sigma \sqrt{2\pi}} \frac{\beta^\alpha}{\Gamma(\alpha)} \left(\frac{1}{\sigma^2}\right)^{\alpha+1} \exp\left[-\frac{2\beta + \nu(\mu_0 - \mu)^2}{2\sigma^2}\right]$$
where $\mu_0, \nu, \alpha$ and $\beta$ are the distribution parameters and $\Gamma(\cdot)$ is the Gamma function which is implemented in scipy.special.gamma
.
A consequences of this choice, which we will explore in more detail in the next section on Bayesian inference, is that the a posteriori probability is again NIG, just with different parameters.
It is interesting to get the intuition behind the prior parameters. One can think of them as if before any data have been seen, the mean $\mu$ was estimated from $\nu$ observations with sample mean $\mu _{0}$ and the variance $\sigma^2$ was estimated from $2\alpha$ observations with sample mean $\mu _{0}$ and sum of squared deviations $2\beta$.
Alternatively, we may be interested in providing a prior values for $\mu$ and $\sigma^2$ (denoted $\mu_{my}, \sigma_{my}^2$) and then control the shape of the distribution around this peak. In this case, we set $\mu_0 = \mu_{my}$ and the parameters $\alpha$ and $\beta$ are linked by $\beta = \sigma_{my}^2 * (\alpha + 1.5)$, while $\nu$ stays a free parameter (this comes from the equations of the mode of the NIG distribution). So, using this parametrization we would specify e.g. $\mu_{my}, \sigma_{my}^2, \alpha$ and $nu$ instead of the original four parameters.
The prior pdf is implemented in norm_inv_gamma_pdf
. Go and check that the implementation fits the formulas listed above.
Tasks:
plot_prior
function together with its mode. Save the figure as map_prior_normal.png
. map_normal_derivation.jpg
. map_estim_normal
function which returns the MAP estimate of the parameters given the data and the prior probability parameters. mu_map, var_map = map_estim_normal(np.array([2, 7, 3, 5, 1, 0]), mu0=0, nu=5, alpha=1, beta=2.5) # -> (1.6363636363636365, 5.77685950413223)
mle_map_prior_comparison_normal.png
. mle_map_normal_dataset_sizes.png
.
In the Bayesian inference, we are interested in computing a posteriori distribution $p(\mu, \sigma^2 | X)$ over possible parameter values. We use the Bayes' rule:
\begin{align*} p(\mu, \sigma^2 | X, \mu_0, \nu, \alpha, \beta) &= \frac{\prod_{i=1}^N p(x_i|\mu, \sigma^2) p(\mu, \sigma^2)}{p(X)} \\ &=\frac{\prod_{i=1}^N \mathcal{N}(x_i|\mu, \sigma^2) \mbox{NIG}(\mu, \sigma^2 | \mu_0, \nu, \alpha, \beta)}{p(X)} \\ &= \frac{\kappa \; \mbox{NIG}(\mu, \sigma^2 | \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta})}{p(X)} \end{align*}
where the conjugate relationship between the likelihood and prior has been used and $\kappa$ is the associated normalization constant. The parameters of the new normal inverse gamma distribution can be shown to be
\begin{align*} \tilde{\alpha} = \alpha + N/2 \qquad \tilde{\nu} = \nu + N \qquad \tilde{\mu}_0 = \frac{\nu \mu_0 + \sum_i x_i}{\nu + N} \\ \tilde{\beta} = \beta + \frac{\sum_i x_i^2}{2} + \frac{\nu \mu_0^2}{2} - \frac{(\nu \mu_0 + \sum_i x_i)^2}{2(\nu + N)} \end{align*}
Notice further, that the a posteriori is a distribution, i.e. it integrates to one, so also the right hand side of the equation must be a distribution and thus the constant $\kappa$ and the denominator must exactly cancel out. So, thanks to the conjugacy of the prior, we again arrive at a close-form solution to the a posteriori distribution:
$$ p(\mu, \sigma^2 | X, \mu_0, \nu, \alpha, \beta) = \mbox{NIG}(\mu, \sigma^2 | \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) $$
Task:
bayes_posterior_params_normal
and use the function plot_posterior_normal
to plot the a posteriori probability over $\mu$ and $\sigma^2$ values. Mark the MLE, MAP and maximum prior solutions. Save the figure as bayes_posterior_normal.png
.
Notice, that the graph looks exactly the same as for the MAP objective we created above. This is due to the normalisation of the plots in matplotlib. However, as notices above, the MAP objective is not a probability distribution as it does not integrate to one. On the other hand, the a posteriori $p(\mu, \sigma^2 | X, \mu_0, \nu, \alpha, \beta)$ is a correct distribution which integrates to one.
One interpretation of this graph is that, truly, the most likely solution is the MAP estimate, but there are also other possible solutions which only have smaller probability. What we want to do is to take them into account when making the prediction of the estimated distribution.
Predictive density
In the Bayesian inference we do not take only the most probable estimate of $\mu$ and $\sigma^2$ and use it for evaluation of $p(x|\mu, \sigma^2)$, but we compute a weighted average of the predictions at a test point $x^*$ for each possible parameter settings, where the weighting is given by the above posterior distribution.
\begin{align*} p(x^* | X, \mu_0, \nu, \alpha, \beta) &= \int \int p(x^* | \mu, \sigma^2) p(\mu, \sigma^2 | X, \mu_0, \nu, \alpha, \beta) d\mu d\sigma \\ &= \int \int \mathcal{N}(x^*|\mu, \sigma^2) \mbox{NIG}(\mu, \sigma^2 | \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) d\mu d\sigma \\ &= \int \int \kappa(x^*, \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) \mbox{NIG}(\mu, \sigma^2 | \breve{\mu}_0, \breve{\nu}, \breve{\alpha}, \breve{\beta}) d\mu d\sigma \end{align*}
where we used the conjugate relation for the second time (adopted from Prince book). We will call this distribution predictive density. Taking the constant out of the integrals we get
\begin{align*} p(x^* | X, \mu_0, \nu, \alpha, \beta) &= \kappa(x^*, \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) \int \int \mbox{NIG}(\mu, \sigma^2 | \breve{\mu}_0, \breve{\nu}, \breve{\alpha}, \breve{\beta}) d\mu d\sigma \\ &= \kappa(x^*, \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) \end{align*}
as the integral over the distribution is one and the constant $\kappa$ is given by
$$ \kappa(x^*, \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) = \frac{1}{\sqrt{2\pi}} \frac{\sqrt{\tilde{\nu}}\tilde{\beta}^{\tilde{\alpha}}}{\sqrt{\breve{\nu}}\breve{\beta}^{\breve{\alpha}}} \frac{\Gamma(\breve{\alpha})}{\Gamma(\tilde{\alpha})} $$
and
\begin{align*} \breve{\alpha} = \tilde{\alpha} + 1/2 \qquad \breve{\nu} = \tilde{\nu} + 1 \\ \breve{\beta} = \frac{{x^*}^2}{2} + \tilde{\beta} + \frac{\tilde{\nu}\tilde{\mu}_0^2}{2} - \frac{(\tilde{\nu}\tilde{\mu}_0 + x^*)^2}{2(\tilde{\nu} + 1)} \end{align*}
Tasks:
bayes_estim_pdf_normal
using the above equations. mle_map_bayes_normal.png
. noise.png
.
Next, lets look at estimating parameters of a categorical distribution. Suppose that our dataset $X=\{x_1, \ldots, x_N\}$ consists of points $x_i \in \{1, 2, \ldots, K\}$, like if we were throwing a biased dice or estimating preferences of $K$ presidential candidates. The categorical distribution is defined as
$$ p(x = k| p_1, \ldots, p_K) = p_k \qquad \mbox{where} \quad \sum_{k=1}^K p_k = 1$$
For the ML and MAP we estimate the parameters $\{p_k\}_{k=1}^K$, for the Bayesian approach we compute a probability distribution over the parameters.
Tasks:
categorical_data.png
.
Tasks:
mle_categorical_derivation.jpg
ml_estim_categorical
and use it to estimate the distribution parameters. mle_categorical.png
.
As a prior needed for MAP estimation we choose the Dirichlet distribution as it is conjugate to the categorical likelihood
$$ \mbox{Dir}(p_1, \ldots, p_K; \alpha_1, \ldots, \alpha_K) = \frac{1}{B(\alpha)} \prod_{k=1}^K p_k^{\alpha_k - 1} $$
where $B(\alpha)$ is a normalization constant and $\alpha = (\alpha_1, \ldots, \alpha_K)$. It is implemented in numpy.random.dirichlet
.
Tasks:
map_categorical_derivation.jpg
. map_estim_categorical
and use it to estimate the distribution parameters. map_categorical.png
.
In the Bayesian approach we calculate a posteriori probability over the parameters (derivations adopted from Prince book)
\begin{align*} p(p_1, \ldots, p_6| X) &= \frac{\prod_{i=1}^N p(x_i|p_1, \ldots, p_6) p(p_1, \ldots, p_6)}{p(X)} \\ &= \frac{\prod_{i=1}^N \mbox{Cat}(x_i|p_1, \ldots, p_6) \mbox{Dir}(p_1, \ldots, p_6; \alpha_1, \ldots, \alpha_6)}{p(X)} \\ &= \frac{\kappa(\alpha_{1},\ldots, \alpha_{6}, X) \mbox{Dir}(p_1, \ldots, p_6; \tilde{\alpha}_1, \ldots, \tilde{\alpha}_6)}{p(X)} \\ &= \mbox{Dir}(p_1, \ldots, p_6; \tilde{\alpha}_1, \ldots, \tilde{\alpha}_6) \end{align*}
where $\tilde{\alpha}_k = N_k + \alpha_k$ and where we have once again exploited the conjugate relationship between the likelihood and prior. And as before, the normalization constant $\kappa$ cancels out with the denominator in order to obtain a valid probability distribution.
The predictive density used in ML and MAP was simply the categorical distribution with the estimated parameters $\hat{p}_k$. In the Bayesian case, we compute a weighted average of the predictions for each possible parameter set, where the weighting is given by the posterior distribution.
\begin{align*} p(x^*|X) &= \int p(x^*|p_1, \ldots, p_6) p(p_1, \ldots, p_6| X) dp_{1\ldots 6} \\ &= \int \mbox{Cat}(x^*|p_{1\ldots 6}) \mbox{Dir}(p_{1\ldots 6}; \tilde{\alpha}_{1\ldots 6}) dp_{1\ldots 6} \\ &= \int \kappa(x^*, \tilde{\alpha}_{1\ldots 6}) \mbox{Dir}(p_{1\ldots 6}; \breve{\alpha}_{1\ldots 6}) dp_{1\ldots 6} \\ &= \kappa(x^*, \tilde{\alpha}_{1\ldots 6}) \end{align*}
and it can be shown that
$$ p(x^*=k|X) = \kappa(x^*, \tilde{\alpha}_{1\ldots 6}) = \frac{N_k + \alpha_k}{\sum_{j=1}^6 (N_j + \alpha_j)} $$
Tasks:
bayes_posterior_params_categorical
function. bayes_posterior_categorical.png
bayes_estim_categorical
which returns the Bayesian estimate. bayes_categorical.png
. In this part we will use the MLE, MAP and Bayesian estimates for the Normal distribution to build a Bayes classifier as in the previous assignment. Feel free to use your code from the previous lab to finish this task. However, upload every function needed so that we can evaluate it automatically.
The task is again to distinguish the images of letters 'A' from images of 'C' using the compute_measurement_lr_cont
measurement.
Spend some time specifying a meaningful prior for MAP and Bayesian estimates for both classes. The old values (0, 1) will no longer work here as the mean and variance of the data is different (in thousands)! It is recommended to plot the prior for a visual feedback for the parameters.
Tasks:
mle_Bayes_classif
, map_Bayes_classif
and bayes_Bayes_classif
. Assume 0/1 loss function. The first two will use the strategy from the Bayes classification labs, whereas for the bayesian estimate the pdf is no longer a Normal distribution, so we have to use the inequality $p(A)p(x|A) > p(C)p(x|C)$ for deciding for the class. Compute the a priori probabilities as relative frequencies in the training dataset. mle_map_bayes_Bayes_classifier.png
. Simon J.D. Prince. Computer Vision: Models, Learning, and Inference. Cambridge University Press, 2012