====== Exercise 2 ======
Program:
* Revision: Training and testing error, basis expansion
* Linear regression and polynomial model
* Nearest neighbors regression
Downloads:
* {{:courses:y33aui:cviceni:datafor1dregression.zip|Data to model}}
**Likely, you will not manage to finish the implementation of all the functions during the excercise in the lab. Finish them as a home work.**
===== Revision =====
Be sure to understand the following concepts:
* training and testing error, their relation to model flexibility (lecture 1)
* basis expansion (lecture 2)
===== Linear regression "theory" =====
Linear regression is used to model continuous dependent variables. The model has the form of a linear function.
==== 1D case ====
Assume we have 1D data. Each object is described only with 1 feature ($x_i \in X$) and with the value of dependent variable (y_i \in Y). The goal is to find the function (or more precisely the parameters $w_1$ and $w_0$ of the function)
$$ f(x) = w_1x + w_0 $$
so that the error
$$ MSE = \frac{1}{N} \sum_{i=1}^{N} (y_i - f(x_i))^2 = \frac{1}{N} \sum_{i=1}^{N} (y_i - (w_1x_i + w_0))^2 $$
which the model makes on the training data is minimal.
In case of linear regression and MSE minimization, there are explicit formulas to compute the weights //w// from training data:
\[w_1 = \frac{\sum_{i=1}^N (x_i - \bar X)(y_i - \bar Y)}{\sum_{i=1}^N (x_i - \bar X)^2} = \frac{s_{XY}}{s_X^2}\ \ \ \text{a}\ \ \ w_0 &=& \bar Y - w_1\bar X, \]
where $s_{XY}$ is the covariance of //X// and //Y// and $s_{X}^2$ is the variance of //X//.
==== 1D case, homogeneous coordinates ====
If we introduce the so-called homogeneous coordinates (the description of objects is enriched with one feature that always have value 1), the description of input variables will look like this:
\[ X = \left(
\begin{array}{cccc}
x_1 & x_2 & \ldots & x_N \\
1 & 1 & \ldots & 1
\end{array}
\right).
\]
Arranging all the coefficients of the linear function into a vector \mathbf{w}=(w_1, w_0)^T, we can apply the linear function on all objects at once as
\[ Y = \mathbf{w}^T X. \]
The weight vector can then be computed as
\[ \mathbf{w} = (XX^T)^{-1}XY^T \]
==== Moredimensional case ====
In //D//-dimensional space we can use the same formula. The weight vector will contain //D//+1 coefficients, the description of 1 object will be constituted by //D//+1 variables (one of them always equal to 1).
===== Modeling data using linear regression =====
==== Linear regression in MATLAB ====
Learn how to compute the weight vector in MATLAB. Do not use the above mentioned formula, see
* ''help slash'' and
* ''help regress''.
==== Simple polynomial mapping ====
Using linear regression, we would like to fit also polynomial model to the data. This can be accomplished by basis expansion, i.e. by first mapping the data into a higher-dimensional space.
Create function with the following prototype:
function xp = mapForPolynom(x, k)
^ Inputs | ''x'' | [1 x //N//] | 1D description of objects. Values of the input for all objects. |
^ | ''k'' | scalar | The highest power of ''x'' to be present in the output |
^ Outputs | ''xp'' | [//k+1// x //N//] | Matrix of powers of //x//.|
**Output ''xp'':** the first row will be equal to ''x''k, the second row will be equal to ''x''k-1, ..., and the one to last row will be equal to ''x'' and the last row is full of ones.
==== Training the polynomial model ====
We would like to use polynomial models of various degrees to model the data.
Create function with the following prototype:
function model = trainRegrPolynom(x, y, k)
^ Inputs | ''x'' | [1 x //N//] | 1D description of objects. Values of the independent variable for all training data. |
^ | ''y'' | [1 x //N//] | The vector of values of the dependent variable for all training data. |
^ | ''k'' | scalar | The degree of polynom we would like to use as the model. |
^ Outputs | ''model'' | [(//k+1//) x 1] | Vector of coefficients of the polynomial model. |
==== Applying the polynomial model on new data ====
After training a polynomial model, we would also like to apply the model on new data.
Create function with the following prototype:
function yp = predRegrPolynom(model, x)
^ Inputs | ''model'' | [(//k+1//) x 1] | Vector of coefficients of the linear model. |
^ | ''x'' | [1 x //N//] | 1D description of objects. Values of the input for all test objects. |
^ Outputs | ''yp'' | [1 x //N//] | The vector of predictions of the dependent variable for all test objects. |
==== Putting it all together ====
Create a script that will
* load the {{:courses:y33aui:cviceni:datafor1dregression.zip|data}},
* split them into training and testing set,
* train a polynomial model of chosen degree, and
* compute error for both training and testing data.
After this script will work perfectly, enrich it with the ability to
* model the data with a polynom with increasing degree (say, from 0 to 10),
* plot the error on training and testing data as a function of the degree of polynom.
===== Nearest-neighbors method =====
K-nearest neighbors (kNN) method is a simple and universal modeling method. It creates its prediction in the following way:
* for each object //x// in the test set
* it looks up //k// most similar objects in the training set and
* it aggregates the values of the dependent variable to form its prediction. Usually, the aggregation means
* (weighted) averaging (or finding the median) in case of regression modeling,
* (weighted) majority voting in case of classification.
In case of polynomial modeling, the model was a vector of weights. What is the model in kNN method?
==== Helper function ====
We have to be able to determine the //k// nearest neighbors of points from one set in another set.
Create function with the following prototype:
function iknn = findkNN(xtr, xtst, k)
^ Inputs | ''xtr'' | [//D// x //Ntr//] | Matrix with the training vectors. Prototype vectors. We search the neighbors among them. |
^ | ''xtst'' | [//D// x //Ntst//] | Matrix with the testing vectors. Query vectors. We search the neighbors for them. |
^ | ''k'' | scalar | The number of neighbors. |
^ Outputs | ''iknn'' | [//k// x //Ntst//] | Matrix of indices of the nearest neighbors. |
**Output ''iknn'':** for each testing vector, it will contain indices into the matrix of training vectors.
* ''iknn(:, 3)'' will be a column vector of indices of //k// nearest neighbors of the 3rd testing vector.
* ''xtr(:, iknn(1,3) )'' should be the nearest neighbor of the 3rd testing vector found in the training points.
A small help:
* Do you remember ''distmat()'' function we created last week?
* What does function ''min'' return when called in the following way: ''[foo, imin] = min(dm, [], 1)''?
* Can the distance matrix be modified somehow so that additional call to ''min'' can be used to return the second, third, etc. lowest element?
==== Training the kNN model ====
Create function with the following prototype:
function model = trainRegrKNN(x, y, k)
^ Inputs | ''x'' | [D x //N//] | Values of the inputs for all objects. In this case, D = 1. |
^ | ''y'' | [1 x //N//] | The vector of values of the dependent variable for all objects. |
^ | ''k'' | scalar | The number of nearest neighbors we would like to use as the model. |
^ Outputs | ''model'' | struct | Training data ''x'',''y'' and the number of neighbors ''k''. |
==== Applying the kNN model on new data ====
After training the kNN model, we would also like to apply the model on new data.
Create function with the following prototype:
function yp = predRegrKNN(model, x)
^ Inputs | ''model'' | struct | Model for the kNN method. The output of the ''trainRegrKNN'' function. |
^ | ''x'' | [D x //N//] | Values of the inputs for all test objects. In this case, D = 1. |
^ Outputs | ''yp'' | [1 x //N//] | The vector of predictions of the dependent variable for all test objects. |
==== Putting it all together ====
Create a script that will
* load the {{:courses:y33aui:cviceni:datafor1dregression.zip|data}},
* split them into training and testing set,
* train a kNN model with chosen //k//, and
* compute error for both training and testing data.
After this script will work perfectly, enrich it with the ability to
* model the data with a kNN method with increasing //k// (say, from 1 to 20),
* plot the error on training and testing data as a function of the number of nearest neighbors //k//.