Warning

This page is located in archive.
Go to the latest version of this course pages.

As in the previous lab, we are again interested in probability density $p_{X|k}(x|k)$ estimation. Sometimes, however, one is challenged with the problem of **estimating the probability density function without even knowing its parametric form** (cf MLE labs). In such cases, **non-parametric estimation** using Parzen window method [1] can be applied.

The kernel density estimation $\hat{f}_h(x)$ of an unknown distribution $f(x)$ given a set $\{x_1, \ldots, x_n\}$ of independent identically distributed (i.i.d) measurements is defined as $$ \hat{f}_h(x) = \frac{1}{n}\sum_{i=1}^n K_h(x - x_i) $$ where $K_h$ is the kernel (a symmetric function that integrates to one) and $h$ is the kernel size (bandwidth). Different kernel functions can be used (see [1]), but we will use $K_h(x) \equiv N(0, h^2)$, i.e. the normal distribution with mean 0 and standard deviation $h$.

Please, note that the kernel parameter $h$ does not make the method “parametric”. The parameter defines smoothness of the estimate, not the form of the estimated distribution. Nevertheless, the choice of $h$ still influences the quality of the estimate as you will see below. Using cross-validation [4] we will find such value of $h$ that ensures the estimate generalises well to previously unseen data.

Finally, using the above estimates of the probability density functions $p_{X|k}(x|k)$, we will once again build a Bayesian classifier for two letters and test it on an independent test set.

General information for Python development.

To fulfill this assignment, you need to submit these files (all packed in one `.zip`

file) into the upload system:

- a script for data initialization, calling of the implemented functions and plotting of their results (for your convenience, will not be checked)`parzen.ipynb`

- containing the following implemented methods:`parzen.py`

- Parzen window estimator`my_parzen`

- function for computing cross-validated log-likelihood ratio as a function of $h$`compute_Lh`

- classification using Parzen window density estimates`classify_bayes_parzen`

- plots of estimates of $p_{X|k}(x|A)$ for varying $h$`parzen_estimates.png`

,`optimal_h_classA.png`

- graph with optimal bandwidth $h$ for class A and C respectively`optimal_h_classC.png`

- image with parzen classification`parzen_classif.png`

** Use template of the assignment.** When preparing a zip file for the upload system, **do not include any directories**, the files have to be in the zip file root.

In this part we will implement the Parzen window density estimation method (see the formula above). For measurements, use the function ** compute_measurement_lr_cont**. The data loaded in the

`parzen.ipynb`

template contain training data `trn`

and test data `tst`

- images and their correct labels. For both, the label 0 denotes an image of letter 'A' and label 1 an image of letter 'C'.
Do the following:

- For each image from the training set
`trn`

compute the measurement $x$. Store the measurements for different classes separately, i.e. create two vectors`xA`

and`xC`

.

- Complete the template of the function

so that for a given $x$ (possibly a vector of values) it computes the Parzen estimation of probability density $p_{X|k}(x|k)$. Here,**p = my_parzen(x, trn_x, h)**`trn_x`

is a vector of training measurements for one of the classes (`xA`

or`xC`

) and $h$ is the kernel bandwidth. As the kernel function $K_h(x)$ use the normal distribution $N(0, h^2)$.

**Hint:**Use`scipy.stats.norm.pdf()`

to calculate $K_h$.

p = my_parzen(np.array([1, 2, 3]), np.array([-1, 0, 2, 1.5, 0]), 1.0) print(p) # -> [0.22639369 0.17268428 0.07609717]

- Use
`my_parzen`

function to estimate the probability distribution $p_{X|k}(x|A)$. Show the normalised histogram of`xA`

and the estimated $p_{X|k}(x|A)$ for $h \in \{100, 500, 1000, 2000\}$ in one figure (use`subfigure`

function) and save it as.`parzen_estimates.png`

- To compute the estimate of $p_{X|k}(x|A)$ use your function: $p_{X|k}(x|A)$ =
`my_parzen(x, xA, h)`

- Iterate the variable $x$ e.g. from
`min(xA)`

to`max(xA)`

, with a step of 100.

As we have just seen, different values of $h$ produce different estimates of $p_{X|k}(x|A)$. Obviously, for $h=100$ the estimate is influenced too much by our limited data sample and it will not generalise well to unseen examples. For $h=2000$, the estimate is clearly oversmoothed. So, what is the best value of $h$?

To assess the ability of the estimate to generalise well to unseen examples we will use 10-fold cross-validation [4]. We will split the training set ten times (10-fold) into two sets $X^i_{\mbox{trn}}$ and $X^i_{\mbox{tst}}$, $i=1,\ldots,10$ and use the set $X^i_{\mbox{trn}}$ for distribution estimation and the set $X^i_{\mbox{tst}}$ for validation of the estimate. We will use the log-likelihood of the validation set given the bandwidth $h$, $L^i(h) = \sum_{x\in X^i_{\mbox{tst}}} log(p_{X|k}(x|A))$, averaged over ten splits $L(h)=1/10 \sum_{i=1}^{10} L^i(h)$ as a measure of the estimate quality. The value of $h$ which maximises $L(h)$ is the one which produces an estimate which generalises the best. See [4] for further description of the cross-validation technique.

Do the following:

- Estimate the optimal value of the bandwidth $h$ using the 10-fold cross-validation. The procedure (shown for the class A, the class C is analogous) is as follows:
- Split the training set
`xA`

into 10 subsets using the`crossval`

function.

- Complete the template function
.`Lh = compute_Lh(itrn, itst, x, h)`

The function needs to compute the log-likelihood $L^i(h)$ on the each pair of training and test sets $X^i_{\mbox{trn}}$ (`itrn{i}`

) and $X^i_{\mbox{tst}}$ (`itst{i}`

) using the training part for estimation of $p_{X|k}(x|A)$ and the test part for computing the log-likelihood. The output is the mean log-likelihood $L(h)$ over all 10 folds.

**Hint 1:**$p_{X|k}(x|A) = \mbox{my_parzen}(x, X^i_{\mbox{trn}}, h)$.

**Hint 2**:`itrn{i}`

and`itst{i}`

contain only data indexes, not data themselves!

- Plot the log-likelihood average $L(h)$ for $h = 100, 150, \ldots, 1000$. Find the optimal value of $h$ using the
`scipy.optimize.fminbound`

function and show the density estimate of $p_{X|k}(x|A)$ with the optimal $h$. Save the graphs into`optimal_h_classA.png`

- Repeat the previous steps to estimate the distribution of $p_{X|k}(x|C)$ (save the figure into
.`optimal_h_classC.png`

- Use the estimated distributions $p_{X|k}(x|A)$ and $p_{X|k}(x|C)$ to create a Bayesian classifier with zero-one loss function:

- Estimate the a priori probabilities $p_K(A)$ and $p_K(C)$ on the training set, as you did last week.

- Complete the template of the function
which takes the measurements`labels = classify_bayes_parzen(x_test, xA, xC, pA, pC, h_bestA, h_bestC)`

`x_test`

computed on set`tst`

and finds a label for each test example.

**Hint:**Notice, that`my_parzen`

returns a probability for any $x$, also for $x\in \mbox{x_tst}$. No need for approximation, interpolation or similar techniques here!

For the following questions **it is assumed you run rand('seed', 42);** always before the splitting of the training set (for both A and C). Subsequent uses of the random generator (including

`crossval`

- Is it possible to estimate $p_{X|k}(x|k)$ outside of the range of the [min(x_trn), max(x_trn)] interval?
- a) Yes
- b) No
- c) Only if we have enough data

- What is the optimal value of $h$ for class A and class C rounded to two decimal points?
- What is the classification error (in
**percents**) of the Bayesian classifier rounded to two decimal points?

1. The correct answer is a). Just use the equation. Nothing is preventing evaluating the equation outside the [min(x_trn), max(x_trn)] interval. 2. class A: 307.92; class C: 238.68 3. 7.50

Adapt your code to accept two dimensional measurements

- x = (sum of pixel intensities in the
**left**half of the image) - (sum of pixel intensities in the**right**half of the image) - y = (sum of pixel intensities in the
**top**half of the image) - (sum of pixel intensities in the**bottom**half of the image)

The kernel function will now be a 2D normal distributions with diagonal covariance matrix `cov = sigma^2*eye(2)`

. Display the estimated distributions $p_{X|k}(x|k)$.

- [1] Parzen window
- [3] Parzen Windows Method (V. Hlaváč) in Czech
- [4] Cross-validation

courses/be5b33rpz/labs/05_parzen/start.txt · Last modified: 2022/10/23 23:17 by sochmjan