====== Maximum likelihood, Maximum a posteriori and Bayesian inference estimates of distribution parameters ======

At the very beginning of the recognition labs, we assumed the class probability densities $p_{X|k}(x|k)$ and the a priori probabilities $p_K(k)$ to be known and we used them to find the optimal {{courses:be5b33rpz:labs:02_bayes:|Bayesian strategy}}. Later, we looked at problems where the a priori probability is not known or does not exist and we constructed the optimal {{courses:be5b33rpz:labs:03_minimax:|minimax strategy}}. Today, we face the problem of unknown probability density function $p_{X|k}(x|k)$ and we will be interested in its estimation.

In this lab we will assume that we know the density function is from a class of functions characterized by several parameters, and we will be dealing with estimating their optimal value. In particular, we will consider two cases:
  * the density is modeled by a [[https://en.wikipedia.org/wiki/Normal_distribution|Normal distribution]] -- the unknown parameters will be the mean and variance of the distribution
  * the density is modeled as a [[https://en.wikipedia.org/wiki/Categorical_distribution|Categorical distribution]] -- the unknown parameters will be the probabilities of individual classes/categories
  
To estimate the parameters we will use a training set - a set of examples drawn randomly from the particular distribution.

We will implement and evaluate three parameter estimation methods: Maximum likelihood (ML), Maximum a posteriori (MAP), and Bayesian inference.

The assignment has three parts. In the first one we will apply all the methods to the Normal distribution, in the second to the Categorical distribution and in the third part we will build a Bayes classifier using the estimated densities.


<WRAP center round info>

[[https://cw.fel.cvut.cz/wiki/courses/be5b33rpz/labs/python_development|General information for Python development]].

To fulfill this assignment, you need to submit these files (all packed in one ''.zip'' file) into the [[https://cw.felk.cvut.cz/brute/|upload system]]:
  * **''mle_map_bayes.ipynb''** - a notebook for data initialization, calling of the implemented functions and plotting of their results (for your convenience, will not be checked).
  * **''mle_map_bayes.py''** - file with the following methods implemented:
    * **''ml_estim_normal''** - function which computes a ML estimate of Normal distribution parameters
    * **''ml_estim_categorical''** - function which computes a ML estimate of Categorical distribution parameters
    * **''map_estim_normal''** - function which computes a MAP estimate of Normal distribution parameters
    * **''map_estim_categorical''** - function which computes a MAP estimate of Categorical distribution parameters
    * **''bayes_posterior_params_normal''** - function which computed the a posteriori distribution parameters using Bayesian inference for Normal distribution
    * **''bayes_posterior_params_categorical''** - function which computed the a posteriori distribution parameters using Bayesian inference for Categorical distribution
    * **''bayes_estim_pdf_normal''** - function which computes a Bayesian estimate for Normal distribution as a NIG probability density function
    * **''bayes_estim_categorical''** - function which computes a Bayesian estimate of Categorical distribution parameters
    * **''mle_Bayes_classif''** - function which classifies using ML estimate of normal distributions
    * **''map_Bayes_classif''** - function which classifies using MAP estimate of normal distributions
    * **''bayes_Bayes_classif''** - function which classifies using Bayes estimate of normal distributions
  * **''map_normal_derivation.jpg''** and **''map_categorical_derivation.jpg''** - images of handwritten derivations
  * **''mle_normal.png''**, **''mle_normal_varying_dataset_size.png''**, **''map_prior_normal.png''**, **''mle_map_prior_comparison_normal.png''**, **''mle_map_normal_dataset_sizes.png''**, **''bayes_posterior_normal.png''**, **''mle_map_bayes_normal.png''**, **''noise.png''**, **''categorical_data.png''**, **''mle_categorical.png''**, **''map_categorical.png''**, **''bayes_posterior_categorical.png''**, **''bayes_categorical.png''** and **''mle_map_bayes_Bayes_classifier.png''** - images specified in the assignment

** Use [[https://cw.fel.cvut.cz/wiki/courses/be5b33rpz/labs/python_development#Assignment Templates|template]] of the assignment.** When preparing a zip file for the upload system, **do not include any directories**, the files have to be in the zip file root.

<note warning>It is not possible to import your previously implemented functions from ''bayes.py'' in BRUTE, please copy all the necessary functions directly into ''mle_map_bayes.py''. Sorry for the inconvenience.</note>

<note warning>You will gain max 4 points by uploading your solution (including the handwritten derivations) to the upload system. You will be awarded the remaining 4, if your handwritten notes are correct (we will inspect them manually). We will always make a note in BRUTE when your solution is checked. If you need a new attempt, please, contact directly your lab teaching assistant.</note>

</WRAP>

===== Part 1: Normal distribution =====

At first, we will use synthetic data to examine the basic properties of different estimates. Start by generating some i.i.d. data $X=\{x_1, \ldots, x_N\}$ from a unimodal Normal distribution with $\mu=0$ and $\sigma^2=1$, i.e. $x_i \sim \mathcal{N}(\mu, \sigma^2)$. For most demonstrations $N=50$ data points will be enough. But feel free to experiment with larger datasets.

==== Maximum Likelihood Estimation ====

As you know from the {{courses:be5b33rpz:lectures:pr_03_parameter_estimation_2016_10_17.pdf|lecture}}, our goal in MLE is to find a solution to this problem:
$$\mu_*, \sigma_* = argmax_{\mu, \sigma} p(X| \mu, \sigma) = argmax_{\mu, \sigma} \prod_{i=1}^N p(x_i| \mu, \sigma)\;,$$ where the likelihood $p(x_i|\mu, \sigma)$ is the Normal distribution $\mathcal{N}(x_i|\mu, \sigma)$.

  - Start by deriving the maximum likelihood estimates of the Normal distribution parameters $\mu$ and $\sigma^2$. Use the formulas to complete the function **''ml_estim_normal''**. \\ \\
  - Try to derive the maximum likelihood estimation on a piece of paper - we won't grade it, but it is useful training.<code python>
mu_mle, var_mle = ml_estim_normal(np.array([1, 4, 6, 3, 4, 7]))

# -> 4.166666666666667, 3.805555555555556  
</code> \\
  - Use the function **''plot_likelihood''** to plot the likelihood for a range of $\mu$ and $\sigma^2$ values. Mark the true and estimated values. Next, plot the true and estimated distributions in one graph.{{ :courses:be5b33rpz:labs:04_mle_map_bayes:mle_normal.png?nolink |}} Save the figure as **''mle_normal.png''**.\\ **Hint:** Use **''plt.subplots''**. \\ \\
  - Try to generate the data repeatedly while re-drawing these plots. Observe the variations in the estimate. After seeing this, how much would you trust an ML estimate? Did you know that MLE is probably the most frequently used parameter estimation technique? \\ \\
  - A common way to avoid imprecise ML estimates is to increase the number of data points. Try to plot the true and estimated distributions with increasing number of samples.{{ :courses:be5b33rpz:labs:04_mle_map_bayes:mle_normal_varying_dataset_size.png?nolink |}} Save the figure as **''mle_normal_varying_dataset_size.png''**.

==== MAP Estimation ====

As you know from the {{courses:be5b33rpz:lectures:pr_03_parameter_estimation_2016_10_17.pdf|lecture}}, in the MAP estimation the parameters are found by maximizing the a posteriori probability:

$$\theta_* = {\rm argmax}_{\theta} p(\theta | X) = {\rm argmax}_{\theta} p(X| \theta) p(\theta)$$

The advantage is that we can specify our prior knowledge about the true values of the parameters in the form of the prior $p(\theta)$ and thus avoid wrong uninformed estimates in case too few samples are available.

It was argued in the lecture that in order to obtain a close-form expression for a posterior probability, one should use a [[https://en.wikipedia.org/wiki/Conjugate_prior|conjugate prior]]. In the lecture when we were estimating $\theta = \mu$ only and $\sigma^2$ was assumed to be known, a Normal distribution prior was used, which simplified the derivation significantly. In our current case, we assume $\theta = (\mu, \sigma^2)$, i.e. that both $\mu$ and $\sigma^2$ are unknown. Fortunately, even in this case there exists a conjugate prior. Out of several [[https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf|possible choices]] we will use the [[https://en.wikipedia.org/wiki/Normal-inverse-gamma_distribution|normal inverse gamma distribution]]:

$$p(\mu, \sigma^2) = \mbox{NIG}(\mu, \sigma^2 | \mu_0, \nu, \alpha, \beta) = \frac{\sqrt{\nu}}{\sigma \sqrt{2\pi}} \frac{\beta^\alpha}{\Gamma(\alpha)} \left(\frac{1}{\sigma^2}\right)^{\alpha+1} \exp\left[-\frac{2\beta + \nu(\mu_0 - \mu)^2}{2\sigma^2}\right]$$

where $\mu_0, \nu, \alpha$ and $\beta$ are the distribution parameters and $\Gamma(\cdot)$ is the [[https://en.wikipedia.org/wiki/Gamma_function|Gamma function]] which is implemented in **''scipy.special.gamma''**.

A consequences of this choice, which we will explore in more detail in the next section on Bayesian inference, is that the a posteriori probability is again NIG, just with different parameters.

It is interesting to get the intuition behind the prior parameters. One can think of them as if before any data have been seen, the mean $\mu$ was estimated from $\nu$  observations with sample mean $\mu _{0}$ and the variance $\sigma^2$ was estimated from $2\alpha$ observations with sample mean $\mu _{0}$ and sum of [[https://en.wikipedia.org/wiki/Squared_deviations|squared deviations]] $2\beta$.

Alternatively, we may be interested in providing a prior values for $\mu$ and $\sigma^2$ (denoted $\mu_{my}, \sigma_{my}^2$) and then control the shape of the distribution around this peak. In this case, we set $\mu_0 = \mu_{my}$ and the parameters $\alpha$ and $\beta$ are linked by $\beta = \sigma_{my}^2 * (\alpha + 1.5)$, while $\nu$ stays a free parameter (this comes from [[https://en.wikipedia.org/wiki/Normal-inverse-gamma_distribution|the equations of the mode of the NIG distribution]]). So, using this parametrization we would specify e.g. $\mu_{my}, \sigma_{my}^2, \alpha$ and $nu$ instead of the original four parameters.

The prior pdf is implemented in **''norm_inv_gamma_pdf''**. Go and check that the implementation fits the formulas listed above.

**Tasks:**
  - Plot the distribution using the provided **''plot_prior''** function together with its mode. Save the figure as **''map_prior_normal.png''**. \\ \\
  - Experiment with different settings of $\mu_0, \nu, \alpha, \beta$ to get the feeling for how each of the parameters influences the prior. You will need this understanding later when we will be building a classifier using the MAP estimate. \\ \\
  - Show that the MAP estimate of the parameters $\mu$ and $\sigma^2$ with the normal inverse gamma prior is $$\mu_* = \frac{\nu\mu_0 + \sum_{i=1}^N x_i}{N + \nu}$$ $$\sigma_*^2 = \frac{2\beta + \nu(\mu_0 - \mu_*)^2 + \sum_{i=1}^N (x_i - \mu_*)^2}{N + 3 + 2\alpha}$$ Submit a photo of handwritten derivation notes in **''map_normal_derivation.jpg''**. \\ \\
  - Implement the **''map_estim_normal''** function which returns the MAP estimate of the parameters given the data and the prior probability parameters. <code python>
mu_map, var_map = map_estim_normal(np.array([2, 7, 3, 5, 1, 0]), mu0=0, nu=5, alpha=1, beta=2.5)

# -> (1.6363636363636365, 5.77685950413223)
</code> \\
  - Plot the likelihood, prior and the MAP objective $p(X| \theta) p(\theta)$ next to each other, together with the MLE and MAP estimates and the prior maximum. Save the figure as **''mle_map_prior_comparison_normal.png''**. \\ **Notice:** The MAP objective is **not** a distribution as it does not integrate to one because we omitted the evidence $p(X)$!{{ :courses:be5b33rpz:labs:04_mle_map_bayes:map_mle_prior_comparison_normal.png?nolink |}} \\
  - Examine the behavior of the MAP estimate compared with the ML estimate and the prior. Experiment with
    * repeatedly sampling the dataset and observing the variations in the estimate
    * changing the prior parameters to see their effect on the estimate \\ \\
  - To explore the effect of the prior, plot the MLE and MAP estimates for different dataset sizes (e.g. 1, 5, 50). Plot the MAP objective with the estimates and prior indicated. Plot also the estimated distributions for each dataset size. You should see how the prior becomes more dominant the less data samples are available. Notice that for a single data sample $X = \{x_1\}$ we do not plot the MLE solution. It would actually be $\sigma^2=0$ and $\mu=x_1$, so we would get an infinitely thin and infinitely tall normal distribution centered on the data point. This is of course not a realistic estimate. Save the figure as **''mle_map_normal_dataset_sizes.png''**.{{ :courses:be5b33rpz:labs:04_mle_map_bayes:mle_map_normal_dataset_sizes.png?nolink |}}
  

==== Bayesian Inference ====

In the Bayesian inference, we are interested in computing a posteriori distribution $p(\mu, \sigma^2 | X)$ over possible parameter values. We use the Bayes' rule:

\begin{align*}
p(\mu, \sigma^2 | X, \mu_0, \nu, \alpha, \beta) &= \frac{\prod_{i=1}^N p(x_i|\mu, \sigma^2) p(\mu, \sigma^2)}{p(X)} \\
&=\frac{\prod_{i=1}^N \mathcal{N}(x_i|\mu, \sigma^2) \mbox{NIG}(\mu, \sigma^2 | \mu_0, \nu, \alpha, \beta)}{p(X)} \\
&= \frac{\kappa \; \mbox{NIG}(\mu, \sigma^2 | \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta})}{p(X)}
\end{align*}

where the conjugate relationship between the likelihood and prior has been used and $\kappa$ is the associated normalization constant. The parameters of the new normal inverse gamma distribution [[https://en.wikipedia.org/wiki/Conjugate_prior|can be shown]] to be

\begin{align*}
\tilde{\alpha} = \alpha + N/2 \qquad \tilde{\nu} = \nu + N \qquad \tilde{\mu}_0 = \frac{\nu \mu_0 + \sum_i x_i}{\nu + N} \\
\tilde{\beta} = \beta + \frac{\sum_i x_i^2}{2} + \frac{\nu \mu_0^2}{2} - \frac{(\nu \mu_0 + \sum_i x_i)^2}{2(\nu + N)}
\end{align*}


Notice further, that the a posteriori is a distribution, i.e. it integrates to one, so also the right hand side of the equation must be a distribution and thus the constant $\kappa$ and the denominator must exactly cancel out. So, thanks to the conjugacy of the prior, we again arrive at a close-form solution to the a posteriori distribution:

$$
p(\mu, \sigma^2 | X, \mu_0, \nu, \alpha, \beta) = \mbox{NIG}(\mu, \sigma^2 | \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta})
$$

**Task:**
  * Implement the above formulas into **''bayes_posterior_params_normal''** and use the function **''plot_posterior_normal''** to plot the a posteriori probability over $\mu$ and $\sigma^2$ values. Mark the MLE, MAP and maximum prior solutions.{{ :courses:be5b33rpz:labs:04_mle_map_bayes:bayes_normal_posterior.png?nolink |}} Save the figure as **''bayes_posterior_normal.png''**.

Notice, that the graph looks exactly the same as for the MAP objective we created above. This is due to the normalisation of the plots in matplotlib. However, as notices above, the MAP objective is not a probability distribution as it does not integrate to one. On the other hand, the a posteriori $p(\mu, \sigma^2 | X, \mu_0, \nu, \alpha, \beta)$ is a correct distribution which integrates to one.

One interpretation of this graph is that, truly, the most likely solution is the MAP estimate, but there are also other possible solutions which only have smaller probability. What we want to do is to take them into account when making the prediction of the estimated distribution.

**Predictive density**

In the Bayesian inference we do not take only the most probable estimate of $\mu$ and $\sigma^2$ and use it for evaluation of $p(x|\mu, \sigma^2)$, but we compute a weighted average of the predictions at a test point $x^*$ for each possible parameter settings, where the weighting is given by the above posterior distribution.

\begin{align*}
p(x^* | X, \mu_0, \nu, \alpha, \beta) &= \int \int p(x^* | \mu, \sigma^2) p(\mu, \sigma^2 | X, \mu_0, \nu, \alpha, \beta) d\mu d\sigma \\
&= \int \int \mathcal{N}(x^*|\mu, \sigma^2) \mbox{NIG}(\mu, \sigma^2 | \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) d\mu d\sigma \\
&= \int \int \kappa(x^*, \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) \mbox{NIG}(\mu, \sigma^2 | \breve{\mu}_0, \breve{\nu}, \breve{\alpha}, \breve{\beta}) d\mu d\sigma
\end{align*}

where we used the conjugate relation for the second time (adopted from [[http://www.computervisionmodels.com/|Prince book]]). We will call this distribution //predictive density//. Taking the constant out of the integrals we get

\begin{align*}
p(x^* | X, \mu_0, \nu, \alpha, \beta) &= \kappa(x^*, \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) \int \int \mbox{NIG}(\mu, \sigma^2 | \breve{\mu}_0, \breve{\nu}, \breve{\alpha}, \breve{\beta}) d\mu d\sigma \\
&= \kappa(x^*, \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta})
\end{align*}

as the integral over the distribution is one and the constant $\kappa$ is given by

$$
\kappa(x^*, \tilde{\mu}_0, \tilde{\nu}, \tilde{\alpha}, \tilde{\beta}) = \frac{1}{\sqrt{2\pi}} \frac{\sqrt{\tilde{\nu}}\tilde{\beta}^{\tilde{\alpha}}}{\sqrt{\breve{\nu}}\breve{\beta}^{\breve{\alpha}}} \frac{\Gamma(\breve{\alpha})}{\Gamma(\tilde{\alpha})}
$$

and

\begin{align*}
\breve{\alpha} = \tilde{\alpha} + 1/2 \qquad \breve{\nu} = \tilde{\nu} + 1 \\
\breve{\beta} = \frac{{x^*}^2}{2} + \tilde{\beta} + \frac{\tilde{\nu}\tilde{\mu}_0^2}{2} - \frac{(\tilde{\nu}\tilde{\mu}_0 + x^*)^2}{2(\tilde{\nu} + 1)}
\end{align*}

**Tasks:**
  - Implement the function **''bayes_estim_pdf_normal''** using the above equations. \\ \\
  - Plot the estimated pdf for MLE, MAP and Bayes inference into one graph. Beware that the Bayes estimate is **not** a Normal distribution anymore! Make plots for different dataset sizes. Save the figure as **''mle_map_bayes_normal.png''**.{{ :courses:be5b33rpz:labs:04_mle_map_bayes:mle_map_bayes_normal.png?nolink |}} \\
  - However, imagine that for some reason, part of your data is corrupted by errors/noise. Add a single out-of-distribution sample further from the distribution mean. Plot the estimated distributions using this corrupted dataset. Save the figure as **''noise.png''**.{{ :courses:be5b33rpz:labs:04_mle_map_bayes:noise.png?nolink |}}


==== Take-away summary ====
  * MLE is easy to compute, but it requires large datasets to produce a reliable estimate. With small datasets, it produces often wrong but over-confident (high values of pdf) predictions.
  * MLE is sensitive to outliers.
  * MAP allows to include a prior on the estimated parameters and can thus produce reasonable predictions even on smaller datasets.
  * MAP is also sensitive to outliers as we rely on a single estimate and neglecting the rest of the distribution over the parameters.
  * In the limit, when increasing the data size, MLE converges to MAP solution.
  * When given a uniform prior, MAP becomes MLE (shown in the lecture slides). Bayesian inference can work even with a uniform prior, however, having a good prior improves the estimate.
  * Bayesian estimates tend to be rather conservative (notice the heavier tails of the distribution) giving some probability to even data values not observed yet.
  * Ideally, we would like to use the Bayesian inference, but its need for a conjugate prior is limiting and thus we often resort to a less robust method. This then pushes up our needs for large datasets.
  * Alternatively, we may resort to a numerical solution of the Bayesian inference when we do not have a conjugate prior.
  * Another question is, what is a good prior for a particular problem... We will see later how a wrong prior influences a Bayesian //classifier// which uses a MAP estimate.


===== Part 2: Categorical distribution =====

Next, lets look at estimating parameters of a [[https://en.wikipedia.org/wiki/Categorical_distribution|categorical distribution]]. Suppose that our dataset $X=\{x_1, \ldots, x_N\}$ consists of points $x_i \in \{1, 2, \ldots, K\}$, like if we were throwing a biased dice or estimating preferences of $K$ presidential candidates. The categorical distribution is defined as

$$ p(x = k| p_1, \ldots, p_K) = p_k \qquad \mbox{where} \quad \sum_{k=1}^K p_k = 1$$

For the ML and MAP we estimate the parameters $\{p_k\}_{k=1}^K$, for the Bayesian approach we compute a probability distribution over the parameters.

**Tasks:**
  - Start by generating data (e.g. 50 points) from a known categorical distribution with $K=6$ (e.g. a biased dice). Make one of the class probabilities intentionally very small (e.g. 0.01) to be able to observe the effect of zero training samples for one bin. \\ \\
  - Plot side-by-side the distribution itself and the histogram of the measured data (btw, categorical distribution is sometimes referred to as a 'normalized histogram'). Save the figure as **''categorical_data.png''**.{{ :courses:be5b33rpz:labs:04_mle_map_bayes:categorical_data.png?nolink |}}


==== MLE ====

**Tasks:**
  - Try showing that the maximum likelihood estimate of the parameters $p_k$ is $$
\hat{p}_k = \frac{N_k}{\sum_{m=1}^K N_m}
$$ where $N_k$ is the number of measurements of the class $k$ (on a piece of paper, won't be checked) \\ **Hint:** Use the [[https://en.wikipedia.org/wiki/Lagrange_multiplier|Lagrange multiplier method]] (or [[https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/constrained-optimization/a/lagrange-multipliers-examples|here]])for the constraint $\sum_k p_k = 1$ and combine the derivatives over $p_k$ and the Lagrange multiplier $\lambda$. It is also easier to optimize the log of likelihood. \\ \\
  - Complete the function **''ml_estim_categorical''** and use it to estimate the distribution parameters. \\ \\
  - Plot the estimated distribution. Save the figure as **''mle_categorical.png''**.


==== MAP ====

As a prior needed for MAP estimation we choose the [[https://en.wikipedia.org/wiki/Dirichlet_distribution|Dirichlet distribution]] as it is conjugate to the categorical likelihood 

$$
\mbox{Dir}(p_1, \ldots, p_K; \alpha_1, \ldots, \alpha_K) = \frac{1}{B(\alpha)} \prod_{k=1}^K p_k^{\alpha_k - 1}
$$

where $B(\alpha)$ is a normalization constant and $\alpha = (\alpha_1, \ldots, \alpha_K)$. It is implemented in **''numpy.random.dirichlet''**.

**Tasks:**
  - As with the previous prior, we start by examining its behavior for various parameter settings. Sample five random distributions from this prior and plot them side-by-side. Experiment with the parameters $\alpha$:
    * start with $\alpha_k = 1$ for all $k$
    * keep all the values the same, but increase their value
    * try to set one of $\alpha_k$'s significantly higher
    * try to set all $\alpha_k$ below 1 (this will not work with MAP estimate - you may get negative probability estimate){{ :courses:be5b33rpz:labs:04_mle_map_bayes:dirichlet_sample.png?nolink |}} \\ \\ Thus a fair dice would correspond to $\alpha_k \rightarrow \infty$ and when all $\alpha_k$'s are the same and greater than one, the distribution produces a fair dice on average. When the prior is uniform, i.e. when $\alpha_k = 1$ for all $k$, the MAP derivation reduces to the ML estimate. \\ \\ Some more intuition for Dirichlet distribution as a conjugate prior is given in this [[https://towardsdatascience.com/dirichlet-distribution-a82ab942a879|blog post]]. \\ \\
  - Show that the MAP estimate of the parameters $p_k$ is $$
\hat{p}_k = \frac{N_k + \alpha_k - 1}{\sum_{m=1}^K (N_m + \alpha_m - 1)}
$$ where $N_k$ is the number of measurement of the class $k$. \\ **Hint:** The derivation is similar to the MLE estimate. \\ \\
  - Submit a photo of handwritten derivation notes as **''map_categorical_derivation.jpg''**. \\ \\
  - Complete the function **''map_estim_categorical''** and use it to estimate the distribution parameters. \\ \\
  - Plot the estimated distribution. Save the figure as **''map_categorical.png''**.{{ :courses:be5b33rpz:labs:04_mle_map_bayes:map_categorical.png?nolink |}}


==== Bayesian inference ====

In the Bayesian approach we calculate a posteriori probability over the parameters (derivations adopted from [[http://www.computervisionmodels.com|Prince book]])

\begin{align*}
p(p_1, \ldots, p_6| X) &= \frac{\prod_{i=1}^N p(x_i|p_1, \ldots, p_6) p(p_1, \ldots, p_6)}{p(X)} \\
&= \frac{\prod_{i=1}^N \mbox{Cat}(x_i|p_1, \ldots, p_6) \mbox{Dir}(p_1, \ldots, p_6; \alpha_1, \ldots, \alpha_6)}{p(X)} \\
&= \frac{\kappa(\alpha_{1},\ldots, \alpha_{6}, X) \mbox{Dir}(p_1, \ldots, p_6; \tilde{\alpha}_1, \ldots, \tilde{\alpha}_6)}{p(X)} \\
&= \mbox{Dir}(p_1, \ldots, p_6; \tilde{\alpha}_1, \ldots, \tilde{\alpha}_6)
\end{align*}

where $\tilde{\alpha}_k = N_k + \alpha_k$ and where we have once again exploited the conjugate relationship between the likelihood and prior. And as before, the normalization constant $\kappa$ cancels out with the denominator in order to obtain a valid probability distribution.

The predictive density used in ML and MAP was simply the categorical distribution with the estimated parameters $\hat{p}_k$. In the Bayesian case, we compute a weighted average of the predictions for each possible parameter set, where the weighting is given by the posterior distribution.

\begin{align*}
p(x^*|X) &= \int p(x^*|p_1, \ldots, p_6) p(p_1, \ldots, p_6| X) dp_{1\ldots 6} \\
&= \int \mbox{Cat}(x^*|p_{1\ldots 6}) \mbox{Dir}(p_{1\ldots 6}; \tilde{\alpha}_{1\ldots 6}) dp_{1\ldots 6} \\
&= \int \kappa(x^*, \tilde{\alpha}_{1\ldots 6}) \mbox{Dir}(p_{1\ldots 6}; \breve{\alpha}_{1\ldots 6}) dp_{1\ldots 6} \\
&= \kappa(x^*, \tilde{\alpha}_{1\ldots 6})
\end{align*}

and it can be shown that

$$
p(x^*=k|X) = \kappa(x^*, \tilde{\alpha}_{1\ldots 6}) = \frac{N_k + \alpha_k}{\sum_{j=1}^6 (N_j + \alpha_j)}
$$

**Tasks:**
  - Implement the **''bayes_posterior_params_categorical''** function. \\ \\
  - Sample five samples from this posterior and display them side-by-side. Compared with the samples from the prior, they should resemble better the data distribution. Save the figure as **''bayes_posterior_categorical.png''** {{ :courses:be5b33rpz:labs:04_mle_map_bayes:bayes_categorical_posterior_samples.png?nolink |}} \\
  - Implement the function **''bayes_estim_categorical''** which returns the Bayesian estimate. \\ \\
  - Plot the estimated distribution. Save the figure as **''bayes_categorical.png''**. {{ :courses:be5b33rpz:labs:04_mle_map_bayes:bayes_categorical_estim.png?nolink |}} \\
  - Experiment with different priors.


==== Take-away summary ====

  * Notice that even with flat uninformative prior (i.e. when $\alpha_1 = \ldots = \alpha_6 = 1$), Bayesian approach tends to give non-zero probability to bins for which no data has been observed yet.
  * In contrast, in MAP estimate the flat prior resulted in the same result as in ML estimate.


===== Part 3: Building a classifier =====

In this part we will use the MLE, MAP and Bayesian estimates for the Normal distribution to build a Bayes classifier as in the previous assignment. Feel free to use your code from the previous lab to finish this task. However, upload every function needed so that we can evaluate it automatically.

The task is again to distinguish the images of letters 'A' from images of 'C' using the **''compute_measurement_lr_cont''** measurement.

Spend some time specifying a meaningful prior for MAP and Bayesian estimates **for both classes**. The old values (0, 1) will no longer work here as the mean and variance of the data is different (in thousands)! It is recommended to plot the prior for a visual feedback for the parameters.

**Tasks:**
  - Implement the functions **''mle_Bayes_classif''**, **''map_Bayes_classif''** and **''bayes_Bayes_classif''**. Assume 0/1 loss function. The first two will use the strategy from the Bayes classification labs, whereas for the bayesian estimate the pdf is no longer a Normal distribution, so we have to use the inequality $p(A)p(x|A) > p(C)p(x|C)$ for deciding for the class. Compute the a priori probabilities as relative frequencies in the //training// dataset. \\ \\
  - Compute the classification error on the test set for all three methods. \\ \\
  - Visualize the data through normalized histograms, plot the estimated probability densities and the estimated strategies all in one plot. Save the figure as **''mle_map_bayes_Bayes_classifier.png''**. {{ :courses:be5b33rpz:labs:04_mle_map_bayes:mle_map_bayes_bayes_classifier.png?nolink |}} \\
  - Experiment with setting the MAP prior probability parameters in various ways to get the feeling for how they influence the final classification. \\ \\
  - Experiment with different training dataset sizes (20, 200, 2000 samples). Observe the influence of the prior parameters on the classification given different dataset sizes.


===== References =====

Simon J.D. Prince. [[http://www.computervisionmodels.com/|Computer Vision:  Models, Learning, and Inference]]. Cambridge University Press, 2012