Search
As in the previous lab, we are again interested in probability density $p_{X|k}(x|k)$ estimation. Sometimes, however, one is challenged with the problem of estimating the probability density function without even knowing its parametric form (cf MLE labs). In such cases, non-parametric estimation using Parzen window method [1] can be applied.
The kernel density estimation $\hat{f}_h(x)$ of an unknown distribution $f(x)$ given a set $\{x_1, \ldots, x_n\}$ of independent identically distributed (i.i.d) measurements is defined as $$ \hat{f}_h(x) = \frac{1}{n}\sum_{i=1}^n K_h(x - x_i) $$ where $K_h$ is the kernel (a symmetric function that integrates to one) and $h$ is the kernel size (bandwidth). Different kernel functions can be used (see [1]), but we will use $K_h(x) \equiv N(0, h^2)$, i.e. the normal distribution with mean 0 and standard deviation $h$.
Please, note that the kernel parameter $h$ does not make the method “parametric”. The parameter defines smoothness of the estimate, not the form of the estimated distribution. Nevertheless, the choice of $h$ still influences the quality of the estimate as you will see below. Using cross-validation [4] we will find such value of $h$ that ensures the estimate generalises well to previously unseen data.
Finally, using the above estimates of the probability density functions $p_{X|k}(x|k)$, we will once again build a Bayesian classifier for two letters and test it on an independent test set.
General information for Python development.
To fulfill this assignment, you need to submit these files (all packed in one .zip file) into the upload system:
.zip
parzen.ipynb
parzen.py
my_parzen
compute_Lh
classify_bayes_parzen
parzen_estimates.png
optimal_h_classA.png
optimal_h_classC.png
parzen_classif.png
Use template of the assignment. When preparing a zip file for the upload system, do not include any directories, the files have to be in the zip file root.
In this part we will implement the Parzen window density estimation method (see the formula above). For measurements, use the function compute_measurement_lr_cont. The data loaded in the parzen.ipynb template contain training data trn and test data tst - images and their correct labels. For both, the label 0 denotes an image of letter 'A' and label 1 an image of letter 'C'.
compute_measurement_lr_cont
trn
tst
Do the following:
xA
xC
p = my_parzen(x, trn_x, h)
trn_x
scipy.stats.norm.pdf()
p = my_parzen(np.array([1, 2, 3]), np.array([-1, 0, 2, 1.5, 0]), 1.0) print(p) # -> [0.22639369 0.17268428 0.07609717]
subfigure
my_parzen(x, xA, h)
min(xA)
max(xA)
As we have just seen, different values of $h$ produce different estimates of $p_{X|k}(x|A)$. Obviously, for $h=100$ the estimate is influenced too much by our limited data sample and it will not generalise well to unseen examples. For $h=2000$, the estimate is clearly oversmoothed. So, what is the best value of $h$?
To assess the ability of the estimate to generalise well to unseen examples we will use 10-fold cross-validation [4]. We will split the training set ten times (10-fold) into two sets $X^i_{\mbox{trn}}$ and $X^i_{\mbox{tst}}$, $i=1,\ldots,10$ and use the set $X^i_{\mbox{trn}}$ for distribution estimation and the set $X^i_{\mbox{tst}}$ for validation of the estimate. We will use the log-likelihood of the validation set given the bandwidth $h$, $L^i(h) = \sum_{x\in X^i_{\mbox{tst}}} log(p_{X|k}(x|A))$, averaged over ten splits $L(h)=1/10 \sum_{i=1}^{10} L^i(h)$ as a measure of the estimate quality. The value of $h$ which maximises $L(h)$ is the one which produces an estimate which generalises the best. See [4] for further description of the cross-validation technique.
crossval
Lh = compute_Lh(itrn, itst, x, h)
itrn{i}
itst{i}
scipy.optimize.fminbound
labels = classify_bayes_parzen(x_test, xA, xC, pA, pC, h_bestA, h_bestC)
x_test
For the following questions it is assumed you run rand('seed', 42); always before the splitting of the training set (for both A and C). Subsequent uses of the random generator (including crossval function) will be unambiguously defined.
rand('seed', 42);
Solution
1. The correct answer is a). Just use the equation. Nothing is preventing evaluating the equation outside the [min(x_trn), max(x_trn)] interval. 2. class A: 307.92; class C: 238.68 3. 7.50
Adapt your code to accept two dimensional measurements
The kernel function will now be a 2D normal distributions with diagonal covariance matrix cov = sigma^2*eye(2). Display the estimated distributions $p_{X|k}(x|k)$.
cov = sigma^2*eye(2)