====== Bayesian Decision Making ======
{{ .:letter_matrix.png?direct&300 |}}

In this lab we will implement an [[wp>Optical_character_recognition|OCR system]]. For simplicity we assume the texts contain only two letters: A and C, and that the letters are already well segmented into 10x10 pixel image chips. We will extract a simple feature from the image chips and use it in a bayesian decision framework to partition the image set. Furthermore, the relative frequency of occurrence of A and C in text is given.  We will consider two cases: (1) the measurements are discrete, and (2) the measurements are continuous.

In the **discrete probabilities** case, we will find the optimal strategy, compute its Bayesian risk, experiment with the loss function and classify independent data samples by applying the theory from the lecture slides directly.

In the **continuous measurements** case, an assumption is made that the extracted feature for each letter class (A,C) is generated by a normal distribution. The 'zero-one' loss will be used to estimate risk, which simplifies the decision strategy.

<WRAP center round info>

[[https://cw.fel.cvut.cz/wiki/courses/be5b33rpz/labs/python_development|General information for Python development]].

To fulfil this assignment, you need to submit these files (all packed in one ''.zip'' file) into the [[https://cw.felk.cvut.cz/brute/|upload system]]:
  * **''answers.txt''** - answers to the Assignment Questions
  * **''bayes.ipynb''** - a notebook for data initialisation, calling of the implemented functions and plotting of their results (for your convenience, will not be checked).
  * **''bayes.py''** - file with the following methods implemented:
    * **''find_strategy_discrete''** - function which finds the optimal discrete strategy (for your convenience, will not be checked)
    * **''bayes_risk_discrete''** - function which computes the Bayesian risk for a given discrete strategy
    * **''classify_discrete''** - function which classifies a given data sample using a discrete strategy
    * **''classification_error_discrete''** - function which computes classification error on test data using discrete strategy
    * **''find_strategy_2normal''** - function which finds the optimal continuous strategy
    * **''bayes_risk_2normal''** - function which computes the Bayesian risk for a given continuous strategy
    * **''classify_2normal''** - function which classifies a given data sample using a continuous strategy
    * **''classification_error_2normal''** - function which computes classification error on test data using continuous strategy
  * **''classif_W1.png''**, **''classif_W2.png''** and **''decision_discrete.png''** - images specified in [[.:start#Discrete measurements|Discrete measurements]]
  * **''thresholds.png''**, **''decision_2normal.png''** - images specified in [[.:start#Continuous measurements|Continuous measurements]]

** Use [[https://cw.fel.cvut.cz/wiki/courses/be5b33rpz/labs/python_development#Assignment Templates|template]] of the assignment.** When preparing a zip file for the upload system, **do not include any directories**, the files have to be in the zip file root.
</WRAP>
===== Discrete measurements =====

The task is to design a classifier (Bayesian strategy) $q(x)$, which distinguishes between 10x10 images of two letters (classes) A and C using only a single measurement (feature):

$$ q: \mathcal{X} \rightarrow D $$ 
$$ \mathcal{X} = \{-10, \ldots, 10\}, \quad D = K = \{A, C\}$$

The measurement $x$ is obtained from an image by computing
<code>
mu = -563.9
sigma = 2001.6
x_unnormalized = (sum of pixel values in the left half of image)
                -(sum of pixel values in the right half of image)
x = (x_unnormalized - mu) / (2 * sigma) * 10
limit x to the interval <-10, 10>
</code>
Note: the normalization squeezes most of the $x$ values in the interval $\langle -10, 10 \rangle$. 

Computation of this measurement is implemented in ''bayes.compute_measurement_lr_discrete''.

The strategy $q(x)$ will be represented in numpy by a (21, ) vector containing 0 if the classification for that value of $x$ is supposed to be A, and 1 if the classification is supposed to be C. Thus given the vector $q$ and some $x$ we can decide for one of the classes.

As required in the formulation of Bayesian task, we are given all the necessary probabilities (specified in the **''bayes.ipynb''**): <WRAP>
^  variable  ^  description  ^
| ''discreteA['Prob']'' | $p_{X|k}(x|A)$ given as a (21, ) numpy array (the size corresponds to the range of the values of $x$) |
| ''discreteA['Prior']'' | prior probability $p_K(A)$ |
| ''discreteC['Prob']'' | $p_{X|k}(x|C)$ given as a (21, ) (the size corresponds to the range of the values of $x$) |
| ''discreteC['Prior']'' | prior probability $p_K(C)$ |
| ''images_test'' | (10, 10, 40) array of **test** images of size 10x10 pixels, depicting letters A and C |
| ''labels_test'' | (40, ) array with ground-truth classification (0=A, 1=C) |
</WRAP>

==== Tasks ====
  - Complete the function template ''bayes_risk_discrete'' so that it returns the Bayesian risk for a given strategy. The input parameters are the distributions, the strategy $q$ and arbitrary loss matrix $W$ (stored in (2, 2) array). \\ \\ <code python>q_discrete = np.array([0]*10 + [1] + [0]*10)
W = np.array([[0, 1], [1, 0]])
discreteA = {'Prior': 0.6153846153846154,
             'Prob': np.array([0.0125, 0., 0., 0.0125, 0.025, 0.0125, 0.025, 0.0375, 0.075, 0.1, 0.2125, 0.1375, 0.15, 0.1, 0.0875, 0.0125, 0., 0., 0., 0., 0.])}
discreteC = {'Prior': 0.38461538461538464,
             'Prob': np.array([0., 0., 0., 0.02, 0.02, 0.22, 0.46, 0.16, 0.1, 0.02, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])}
R_discrete = bayes_risk_discrete(discreteA, discreteC, W, q_discrete)  # -> 0.5153846153846154
</code>
  - Complete the function template ''find_strategy_discrete'' so that it returns the optimal Bayesian strategy. //When the bayesian risk is the same for both classes for a particular $x$, prefer the "A" class to pass the automatic checks in the upload system.// \\ \\ **Hint:** Use the second line of the derivation on slide 11 in the {{:courses:be5b33rpz:lessons:bayes.pdf|lecture slides}}.\\ \\ <code python>W = np.array([[0, 1], [1, 0]])
q_discrete = find_strategy_discrete(discreteA, discreteC, W)
# -> array([0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
     
distribution1 = {}
distribution2 = {}
distribution1['Prior'] = 0.3
distribution2['Prior'] = 0.7
distribution1['Prob'] = np.array([0.2, 0.3, 0.4, 0.1])
distribution2['Prob'] = np.array([0.5, 0.4, 0.1, 0.0])
W = np.array([[0, 1], [1, 0]])
q = find_strategy_discrete(distribution1, distribution2, W)

# -> array([1, 1, 0, 0])
</code> 
  - Compute the risk of the optimal strategy. \\ \\
  - Plot the classification using the ''visualise_discrete'' function for two loss functions: ''W1=np.array([[0, 1], [1, 0]])'' and ''W2=np.array([[0, 5], [1, 0]])'' and save the figures as **''classif_W1.png''** and **''classif_W2.png''**.\\ \\
  - Complete the ''classify_discrete'' function template which realizes the found strategy, so that it returns N labels given (10, 10, N) array of images.\\ \\
  - Complete the ''classification_error_discrete'' function so that it returns the error on the whole image set ''images_test''. The error is defined as number of incorrectly classified samples divided by total number of samples. To measure the error compare your classification with the [[wp>ground truth]] in ''labels_test'' variable. \\ \\ <code python>
W = np.array([[0, 1], [1, 0]])
q_discrete = find_strategy_discrete(discreteA, discreteC, W)
error_discrete = classification_error_discrete(images_test, labels_test, q_discrete)

# -> 0.225
</code>
  - Show the images classified as A and images classified as C in a figure. Save it as **''decision_discrete.png''**.

===== Assignment Questions, part 1 =====

Fill the correct answers to your ''answers.txt'' file. 
<WRAP center round help>
  - What is the sum of ''discreteA['prob']'' array? (you should also be able to argue why :)
  - What is the risk of the optimal strategy for ''W=np.array([ [0, 1], [2, 0] ])''? (do you have an idea where you would make it unbalanced this way?) **round to 4 decimal places** (use the parameters of the distributions defined in task 1)
  - What is the relation between the final classification error (which is calculated by ''classification_error_discrete'' function) and the Bayesian risk (calculated by the ''bayes_risk_discrete'' function):
    * a) Classification error is always higher than the Bayesian risk
    * b) Classification error is always smaller than the Bayesian risk
    * c) They are two independent quantities and minimising one does not necessary minimise the other

Example ''answers.txt'' (bogus answers only, fill the correct ones):    
<code yaml>  
question1 : gen_zero_matrix
question2 : 12.5467
question3 : [b,d,f]
</code>   
</WRAP>

===== Continuous measurements =====

In the second part of the assignment the probabilities $p_{X|k}(x|k)$ are given as continuous [[http://en.wikipedia.org/wiki/Normal_distribution|normal distributions]] (specified in the **''bayes.ipynb''**): $$p_{X|k}(x|A) = \frac{1}{\sigma_k\sqrt{2\pi}}e^{-\frac{(x-\mu_k)^2}{2\sigma_k^2}}$$<WRAP>
^  variable  ^  description  ^
| ''contA['Mean']'' | mean value of the normal distribution $p_{X|k}(x|A)$ |
| ''contA['Sigma']'' | standard deviation of the normal distribution $p_{X|k}(x|A)$ |
| ''contA['Prior']'' | prior probability $p_K(A)$ |
| ''contC['Mean']'' | mean value of the normal distribution $p_{X|k}(x|C)$ |
| ''contC['Sigma']'' | standard deviation of the normal distribution $p_{X|k}(x|C)$ |
| ''contC['Prior']'' | prior probability $p_K(C)$ |
| ''images_test'' | 10x10x40, **test** images of size 10x10 pixels, depicting letters A and C |
| ''labels_test'' | 1x40, vector with ground-truth classification (1=A, 2=C) |
</WRAP>

Example distribution:
<code python>
contA = {'Mean': 124.2625,
         'Sigma': 1434.45420083,
         'Prior': 0.61538462}
contC = {'Mean': -2010.98,
         'Sigma': 558.42857106,
         'Prior': 0.38461538}</code>
Further we assume zero-one loss matrix ''W = np.array([ [0, 1], [1, 0] ])''.

In this particular case, as explained in {{:courses:be5b33rpz:lessons:bayes.pdf|lecture notes, slides 21-22}}, the optimal strategy $q$ can be found by solving the following problem:

$$q(x) = \arg\max_k p_{K|X}(k|x) = \arg\max_k p_{XK}(x,k)/p_X(x) = \arg\max_k p_{XK}(x,k) = \arg\max_k p_K(k)p_{X|K}(x|k),$$

where $p_{K|X}(k|x)$ is called //posterior// probability of class $k$ given the measurement $x$ (i.e. it is the probability of the data being from class $k$ if we measure feature value $x$). Symbol $\arg\max_k$ denotes finding $k$ maximising the argument.

We will also work with an unnormalised measurement here instead of the discrete one used above
<code>
x = ((sum of pixel values in the left half of image)
   -(sum of pixel values in the right half of image))
</code>
Computation of this measurement is implemented in ''compute_measurement_lr_cont''.
==== Quadratic discriminative function ====

At the beginning of the labs we will together show that in the case when $p_{X|k}(x|k)$ are normal distributions and when we consider zero-one loss function, the optimal Bayesian strategy $q(x)$ corresponds to a quadratic inequality (i.e. $ax^2+bx+c \ge 0$).

**Hint:**
$$
\arg\max_{k\in\{A,C\}} p_K(k)p_{X|k}(x|k) \quad \mbox{translates into} \quad p_K(A)p_{X|k}(x|A) \ge p_K(C)p_{X|k}(x|C)
$$

**Optimal strategy representation:** Solving the above quadratic inequality gives up to two thresholds $t_1$ and $t_2$ (zero crossing points) and a classification decision (A or C) in each of the intervals $\mathcal{X}_1 = (-\infty, t_1\rangle$, $\mathcal{X}_2 = (t_1, t_2\rangle$, $\mathcal{X}_3 = (t_2, \infty)$. So similarly to the discrete case we can represent the strategy in python as a dict with the following fields: <WRAP>
^  field  ^  description  ^
| ''q['t1']'' | $t_1$ threshold (possibly $-\infty$) |
| ''q['t2']'' | $t_2$ threshold (possibly $\infty$) |
| ''q['decision']'' | (3, ) np.array containing the optimal decisions for each of the three intervals |
</WRAP>

==== Tasks ====
  - Fill in the formulas for quadratic inequality coefficients $a$, $b$ and $c$ into the ''find_strategy_2normal''.\\ \\
  - Complete the ''find_strategy_2normal'' template by filling in code for the determining the decision vectors. **Make sure that you return proper strategy for all corner cases** (same variances different means, prior = 1 for one class, same mean and one prior significantly higher, both distributions the same but maybe prior is different, high prior + high variance, p(x|A)p(A) = p(x|C)p(C) in exactly one point). **All of these conditions can be tested by observing the quadratic inequality only, no pdf evaluation is needed!**\\ \\ <code python>
q_cont = find_strategy_2normal(contA, contC)

# -> {'t2': -1248.7684903033442, 't1': -3535.997150276402, 'decision': array([0, 1, 0], dtype=int32)}
</code>
  - Using the ''visualize_2norm'' function in the template to display the probabilities $p_{Xk}(x,k)$ and the classification thresholds $t_1$ and $t_2$ and save the figure as **''thresholds.png''**. \\ \\ <WRAP>
{{ :courses:ae4b33rpz:labs:02_bayes:cont_thresholds.png?direct&300 |found threshold}}
</WRAP>
  - Complete the template function ''bayes_risk_2normal''. We can reformulate the equations from {{:courses:be5b33rpz:lessons:bayes.pdf|lecture slides}}, slide 21 as $$ R(q) = \int_X \sum_{k\in K}p_{XK}(x,k)W(k,q(x)) dx = \\ = \int_X p_X(x) \sum_{k\ne q(x)}p_{K|x}(k|x) dx = \\ = \int_X p_X(x)(1 - p_{K|x}(q(x)|x)) dx = \\ = 1 - \int_X p_X(x)p_{K|x}(q(x)|x) dx = \\ = 1 - \int_X p_K(q(x))p_{X|k}(x|q(x)) dx\,,$$ which in our special case simplifies further to $$R(q) = 1-\int_X p_K(q(x))p_{X|k}(x|q(x)) = \\ = 1 - \left( \int_{\mathcal{X}_1} p_K(\mbox{q.decision}(1))p_{X|k}(x|\mbox{q.decision}(1)) + \int_{\mathcal{X}_2} p_K(\mbox{q.decision}(2))p_{X|k}(x|\mbox{q.decision}(2)) + \\ \int_{\mathcal{X}_3} p_K(\mbox{q.decision}(3))p_{X|k}(x|\mbox{q.decision}(3))\right).$$ Thus we only need to compute integrals of the normal distribution corresponding to the strategy decision class over three intervals, multiply them by apriori probabilities and subtract their sum from one.\\ \\ **Hint:** To integrate exponential functions you can use the ''scipy.stats.norm.cdf'' function.\\ \\ <code python>
R_cont = bayes_risk_2normal(contA, contC, q_cont)

# -> 0.13519281686757106
</code>
  - Complete the ''classify_2normal'' function template so that it returns a label given a 10x10 image.\\ \\
  - Complete the ''classification_error_2normal'' function so that it returns the error on the whole image set ''images''. To measure the error compare your classification with the ground truth in ''labels'' variable.\\ \\ <code python>
error_cont = classification_error_2normal(images_test, labels_test, q_cont)

# -> 0.1
</code>
  - Show the images classified as A and images classified as C using the ''montage'' function. Save them as **''decision_2normal.png''** (see **''bayes.ipynb''**)

{{ :courses:ae4b33rpz:labs:02_bayes:continuous_classification_result.png?direct&300 |result of classification with 2 normal distributions,​ zero-one cost function}}

===== Assignment Questions, part 2 =====

Fill the correct answers to your ''answers.txt'' file. 
<WRAP center round help>
 **4.** Which of the conditions allowed us to express the optimal Bayesian strategy as a quadratic discriminative function (check all that apply):
    * a) The conditional probabilities $p(x \mid k)$ follow a Gaussian distribution
    * b) The prior probabilities $p(k)$ of all classes are equal
    * c) The prior probabilities $p(k)$ of all classes are known
    * d) There were two classes only
    * e) The loss is zero for all misclassifications
    * f) The loss for correct classification is zero and the loss for a misclassification is equal for all classes
**5.** If we extend the number of classes from 2 to 3 and keep the zero-one loss function, check all that applies in this situation:
    * a) The measurement is now 2-dimensional instead of 1-dimensional, therefore we model $p_{X|x}(x|k)$ with a 2-dimensional normal distribution
    * b) The discriminative function is now cubic in $x$ instead of quadratic
    * c) We always select the class $k$ such that the Bayesian risk is minimal
    * d) An optimal Bayesian strategy cannot be found for more than 2 classes
</WRAP>
   
===== Bonus task =====

This task is not compulsory. Work on bonus tasks deepens your knowledge about the subject. Successful solution of a bonus task(s) will be rewarded by up to 4 points.

The data for this task is located in ''data_33rpz_bayes_bonus.npz''. Use the following distributions ''D1'', ''D2'' and ''D3'': <code python>
D1 = {'Mean': np.array([151.61, 1154.01]),
      'Prior': 0.4166666666666667,
      'Cov': np.array([[2048960.78575758, 169829.10494949],
                       [ 169829.10494949, 482516.97969697]])}

D2 = {'Mean': np.array([-2055.76666667,    54.3]),
      'Prior': 0.25,
      'Cov': np.array([[307539.97853107,  -2813.69830508],
                       [ -2813.69830508, 485574.58644068]])}

D3 = {'Mean': np.array([  492.7125, -2353.1125]),
      'Prior': 0.3333333333333333,
      'Cov': np.array([[968935.42262658,  12248.03053797],
                       [ 12248.03053797, 407017.16439873]])}</code>
. The task is analogous to the previous one, only there are now two measurements (features) and three classes. The second measurement is

<code>
y = (sum of pixel values in the upper half of image)
   -(sum of pixel values in the lower half of image)
</code>
The alphabet (i.e. the classes) now consists of three letters, 'A', 'C', and 'T'.

The task is to classify the input images into one of the three classes. Use both measurements $x$ and $y$ here. 
The density function $p_{X}(\mathbf{x}) = p_K(A)p_{X|k}(\mathbf{x}|A)+p_K(C)p_{X|k}(\mathbf{x}|C)+p_K(T)p_{X|k}(\mathbf{x}|T)$, where $\mathbf{x} = [x, y]$ is visualised below.

{{  bonus2_small.png  |Bonus task}}

**Output:**

  * Classification error of the optimal Bayesian strategy on the provided image set,
  * Classification of input images, i.e. which image was assigned to which class – draw the images per class.

**Hints:**

  * [[wp>Multivariate normal distribution|Multivariate Gaussian]] distribution density is given by $$f_{\mathbf x}(x_1,\ldots,x_k) =
\frac{1}{\sqrt{(2\pi)^k|\boldsymbol\Sigma|}}
\exp\left(-\frac{1}{2}({\mathbf x}-{\boldsymbol\mu})^T{\boldsymbol\Sigma}^{-1}({\mathbf x}-{\boldsymbol\mu})
\right),$$ where $k$ is number of dimensions, $\boldsymbol\Sigma$ is $k \times k$ covariance matrix and $\boldsymbol\mu$ is $k$-dimensional mean vector.

  * Calculate numerically the Bayesian risk for all decisions for every input image and select the optimal decision.

===== References =====

  * [1] Duda R., Hart P., Stock D.: Pattern Classification, 2001
  * [2] Michail I. Schlesinger, Vaclav Hlavac. Ten Lectures on Statistical and Structural Pattern Recognition. Kluwer Academic Publishers, 2002.
  * [3] {{:courses:be5b33rpz:lessons:bayes.pdf|Slides from lecture}} by Vaclav Hlavac and Jiri Matas.
  * [4] {{:courses:be5b33rpz:labs:02_bayes:bayes_starecv.pdf|Old labs text.}}