====== Exercise 2 ====== Program: * Revision: Training and testing error, basis expansion * Linear regression and polynomial model * Nearest neighbors regression Downloads: * {{:courses:y33aui:cviceni:datafor1dregression.zip|Data to model}} **Likely, you will not manage to finish the implementation of all the functions during the excercise in the lab. Finish them as a home work.** ===== Revision ===== Be sure to understand the following concepts: * training and testing error, their relation to model flexibility (lecture 1) * basis expansion (lecture 2) ===== Linear regression "theory" ===== Linear regression is used to model continuous dependent variables. The model has the form of a linear function. ==== 1D case ==== Assume we have 1D data. Each object is described only with 1 feature ($x_i \in X$) and with the value of dependent variable (y_i \in Y). The goal is to find the function (or more precisely the parameters $w_1$ and $w_0$ of the function) $$ f(x) = w_1x + w_0 $$ so that the error $$ MSE = \frac{1}{N} \sum_{i=1}^{N} (y_i - f(x_i))^2 = \frac{1}{N} \sum_{i=1}^{N} (y_i - (w_1x_i + w_0))^2 $$ which the model makes on the training data is minimal. In case of linear regression and MSE minimization, there are explicit formulas to compute the weights //w// from training data: \[w_1 = \frac{\sum_{i=1}^N (x_i - \bar X)(y_i - \bar Y)}{\sum_{i=1}^N (x_i - \bar X)^2} = \frac{s_{XY}}{s_X^2}\ \ \ \text{a}\ \ \ w_0 &=& \bar Y - w_1\bar X, \] where $s_{XY}$ is the covariance of //X// and //Y// and $s_{X}^2$ is the variance of //X//. ==== 1D case, homogeneous coordinates ==== If we introduce the so-called homogeneous coordinates (the description of objects is enriched with one feature that always have value 1), the description of input variables will look like this: \[ X = \left( \begin{array}{cccc} x_1 & x_2 & \ldots & x_N \\ 1 & 1 & \ldots & 1 \end{array} \right). \] Arranging all the coefficients of the linear function into a vector \mathbf{w}=(w_1, w_0)^T, we can apply the linear function on all objects at once as \[ Y = \mathbf{w}^T X. \] The weight vector can then be computed as \[ \mathbf{w} = (XX^T)^{-1}XY^T \] ==== Moredimensional case ==== In //D//-dimensional space we can use the same formula. The weight vector will contain //D//+1 coefficients, the description of 1 object will be constituted by //D//+1 variables (one of them always equal to 1). ===== Modeling data using linear regression ===== ==== Linear regression in MATLAB ==== Learn how to compute the weight vector in MATLAB. Do not use the above mentioned formula, see * ''help slash'' and * ''help regress''. ==== Simple polynomial mapping ==== Using linear regression, we would like to fit also polynomial model to the data. This can be accomplished by basis expansion, i.e. by first mapping the data into a higher-dimensional space. Create function with the following prototype: function xp = mapForPolynom(x, k) ^ Inputs | ''x'' | [1 x //N//] | 1D description of objects. Values of the input for all objects. | ^ | ''k'' | scalar | The highest power of ''x'' to be present in the output | ^ Outputs | ''xp'' | [//k+1// x //N//] | Matrix of powers of //x//.| **Output ''xp'':** the first row will be equal to ''x''k, the second row will be equal to ''x''k-1, ..., and the one to last row will be equal to ''x'' and the last row is full of ones. ==== Training the polynomial model ==== We would like to use polynomial models of various degrees to model the data. Create function with the following prototype: function model = trainRegrPolynom(x, y, k) ^ Inputs | ''x'' | [1 x //N//] | 1D description of objects. Values of the independent variable for all training data. | ^ | ''y'' | [1 x //N//] | The vector of values of the dependent variable for all training data. | ^ | ''k'' | scalar | The degree of polynom we would like to use as the model. | ^ Outputs | ''model'' | [(//k+1//) x 1] | Vector of coefficients of the polynomial model. | ==== Applying the polynomial model on new data ==== After training a polynomial model, we would also like to apply the model on new data. Create function with the following prototype: function yp = predRegrPolynom(model, x) ^ Inputs | ''model'' | [(//k+1//) x 1] | Vector of coefficients of the linear model. | ^ | ''x'' | [1 x //N//] | 1D description of objects. Values of the input for all test objects. | ^ Outputs | ''yp'' | [1 x //N//] | The vector of predictions of the dependent variable for all test objects. | ==== Putting it all together ==== Create a script that will * load the {{:courses:y33aui:cviceni:datafor1dregression.zip|data}}, * split them into training and testing set, * train a polynomial model of chosen degree, and * compute error for both training and testing data. After this script will work perfectly, enrich it with the ability to * model the data with a polynom with increasing degree (say, from 0 to 10), * plot the error on training and testing data as a function of the degree of polynom. ===== Nearest-neighbors method ===== K-nearest neighbors (kNN) method is a simple and universal modeling method. It creates its prediction in the following way: * for each object //x// in the test set * it looks up //k// most similar objects in the training set and * it aggregates the values of the dependent variable to form its prediction. Usually, the aggregation means * (weighted) averaging (or finding the median) in case of regression modeling, * (weighted) majority voting in case of classification. In case of polynomial modeling, the model was a vector of weights. What is the model in kNN method? ==== Helper function ==== We have to be able to determine the //k// nearest neighbors of points from one set in another set. Create function with the following prototype: function iknn = findkNN(xtr, xtst, k) ^ Inputs | ''xtr'' | [//D// x //Ntr//] | Matrix with the training vectors. Prototype vectors. We search the neighbors among them. | ^ | ''xtst'' | [//D// x //Ntst//] | Matrix with the testing vectors. Query vectors. We search the neighbors for them. | ^ | ''k'' | scalar | The number of neighbors. | ^ Outputs | ''iknn'' | [//k// x //Ntst//] | Matrix of indices of the nearest neighbors. | **Output ''iknn'':** for each testing vector, it will contain indices into the matrix of training vectors. * ''iknn(:, 3)'' will be a column vector of indices of //k// nearest neighbors of the 3rd testing vector. * ''xtr(:, iknn(1,3) )'' should be the nearest neighbor of the 3rd testing vector found in the training points. A small help: * Do you remember ''distmat()'' function we created last week? * What does function ''min'' return when called in the following way: ''[foo, imin] = min(dm, [], 1)''? * Can the distance matrix be modified somehow so that additional call to ''min'' can be used to return the second, third, etc. lowest element? ==== Training the kNN model ==== Create function with the following prototype: function model = trainRegrKNN(x, y, k) ^ Inputs | ''x'' | [D x //N//] | Values of the inputs for all objects. In this case, D = 1. | ^ | ''y'' | [1 x //N//] | The vector of values of the dependent variable for all objects. | ^ | ''k'' | scalar | The number of nearest neighbors we would like to use as the model. | ^ Outputs | ''model'' | struct | Training data ''x'',''y'' and the number of neighbors ''k''. | ==== Applying the kNN model on new data ==== After training the kNN model, we would also like to apply the model on new data. Create function with the following prototype: function yp = predRegrKNN(model, x) ^ Inputs | ''model'' | struct | Model for the kNN method. The output of the ''trainRegrKNN'' function. | ^ | ''x'' | [D x //N//] | Values of the inputs for all test objects. In this case, D = 1. | ^ Outputs | ''yp'' | [1 x //N//] | The vector of predictions of the dependent variable for all test objects. | ==== Putting it all together ==== Create a script that will * load the {{:courses:y33aui:cviceni:datafor1dregression.zip|data}}, * split them into training and testing set, * train a kNN model with chosen //k//, and * compute error for both training and testing data. After this script will work perfectly, enrich it with the ability to * model the data with a kNN method with increasing //k// (say, from 1 to 20), * plot the error on training and testing data as a function of the number of nearest neighbors //k//.