\input{../../header.tex}
\usepackage[utf8]{inputenc}
\title{B(E)3M33UI --- Exercise ML04:\\Model evaluation and diagnostics.}
\author{Petr Po\v{s}\'{\i}k, Ji\v{r}\'{i} Spilka}
\usepackage{multirow}
\begin{document}
\maketitle
\date
\vspace{-1EM}
\noindent The learning goals of this exercise:
\begin{itemize}
\item learn about different performance evaluation metrics
\item evaluate quality of models using different methods (holdout, cross-validation)
\item gain insights into models using diagnostic tools (validation and learning curves)
\end{itemize}
\section{Performance evaluation metrics}
We are already familiar with two different metrics: \textbf{mean square error} for regression task and \textbf{zero-one error} for \textbf{classification} task. In this exercise we further focus on the binary classification only, i.e. $y \in \{0, 1\}$. A detailed information about classification performance provides the so called \href{https://en.wikipedia.org/wiki/Confusion_matrix}{\textbf{confusion matrix}} that describes relationship between true class, $y$, and predicted class by a model, $\hat{y}$.
\begin{table}[ht!]
\renewcommand{\arraystretch}{1.4}
\centering
\begin{tabular}{cc|cc|}
& \multicolumn{1}{c}{} & \multicolumn{2}{c}{\textbf{predicted class}} \\
& \multicolumn{1}{c}{} & \multicolumn{1}{c}{$\hat{y} = 0$} & \multicolumn{1}{c}{$\hat{y} = 1$} \\
\cline{3-4}
\multirow{2}{*}{\textbf{actual class}} & $y = 0$ & True Negative (TN) & False Positive (FP) \\
& $y = 1$ & False Negative (FN) & True Positive (TP)\\
% \cline{3-4}
\cline{3-4}
\end{tabular}
\end{table}
\noindent The \textbf{prediction accuracy} is then given by:
\begin{equation*}
ACC = \frac{TP + TN}{TP + FN + FP + TN}
\end{equation*}
\vspace{0.2em}
\noindent Further, two additional metrics are widely used, \textbf{true positive rate} (TPR) and \textbf{false positive rate} (FPR):
\begin{equation*}
TPR = \frac{TP}{TP + FN} \qquad FPR = \frac{FP}{FP + TN}
\end{equation*}
\vspace{.2em}
\noindent In the previous exercises, we measured the model error by our own implementation. Now, we use the built-in facility of scikit-learn.
\task{In ML04-1.py use SVM classifier and evaluate its performance using confusion matrix, accuracy, true positive rate, and false positive rate. Use the \href{http://scikit-learn.org/stable/modules/classes.html\#module-sklearn.metrics}{\lstinline{sklearn.metrics}} module and print the results. Note that we still use the training data for evaluation!}
\noindent\textbf{Hints:}
\begin{itemize}
\item Experiment with number of features included in the model. Which model is better, using only few features or all available features?
\end{itemize}
\noindent Clearly. the confusion matrix depends on a \textbf{single threshold} used to decide if the example is positive or negative.
\href{https://en.wikipedia.org/wiki/Receiver_operating_characteristic}{Receiver operating characteristic (ROC)} is one of the graphical methods to compare
binary classifiers for \textbf{different values of threshold} and shows how performance change with a \textbf{change of threshold} that is used to divide classes. ROC plots true positive rate (TPR) on $y$ axis and false positive rate (FPR) on $x$ axis. An ideal classifier corresponds to a point (0,1), i.e. $TPR = 1, FPR = 0$.
\task{Study the ROC and the documentation for \lstinline{metrics.sklearn.roc_curve}. Plot a ROC curve for SVM model.}
\task{Experiment with different models (number of features, parameters). Setup several models and compare them using ROC curves.}
\noindent\textbf{Hints:}
\begin{itemize}
\item Study the function \lstinline{plot_roc()} in \lstinline{plotting.py}.
\item You can compare several models of different kind, e.g. logistic regression, SVM, Linear Discriminant Analysis etc., or you can compare the same type of model with different
settings, e.g. SVM with different kernels (linear, polynomial, rbf) and with different $C$, or different kernel parameter \lstinline{gamma}.
\item Can you train a "perfect" model that has Area under ROC curve: AUC = 1? Again, try to answer the question: Which model is better? Which one would you use?
\end{itemize}
\section{Model evaluation}
In the model evaluation we want to \textbf{estimate predictive performance of a model} given new, previously unseen, data (i.e. we want to estimate performance when a model will be deployed in practice). So far we have evaluated models on training data only and we have observed that by tuning model parameters we can obtain correct classification without errors. However, this does not provide information about predictive performance of the model.
\subsection{Training~/~test split (hold out)}
To get some insight, how well the model performs on new data, let's split the data into training and test sets. Then train (learn) models on training set and test it on the both.
\task{In \lstinline{ML04.py}, fill in the code to split the data into training and testing data sets. Use the function \href{http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html}{\lstinline{model_selection.train_test_split}}.}
\noindent\textbf{Hints:}
\begin{itemize}
\item In the output of the scripts, check shapes of resulting Numpy arrays. Are they compatible?
\end{itemize}
\task{In \lstinline{ML04.py}, fill in the code to train the SVM on training data, and compute the accuracy on both, training and testing. How do they compare? Do you like what you see?}
\subsection{$k$-fold cross-validation (CV)}
Simple train/test split is used for initial analysis or for a large datasets where data are abundant. However, when data are scarce we resort to $k$-fold cross-validation technique, where training dataset is randomly split into $k$ fold without replacement. Then $k-1$ folds are used for training and one fold is used for testing. The procedure is repeated $k$ times.
\task{In \lstinline{ML04.py}, use the function \href{https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html\#sklearn.model_selection.cross_val_score}{\lstinline{sklearn.model_selection.cross_val_score}} to get the CV estimate of SVM performance. You should get a list of the accuracies, one for each fold.}
% \noindent\textbf{Hints:}
% \begin{itemize}
% \item Look at the documentation for \href{http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics}{crossvalidation}.
% \end{itemize}
\task{Run the script several times. Do you see any fluctuations in the accuracy estimates for train/test split and cross-validation?}
\task{How do you judge the SVM model? Is it properly set?}
\section{Model tuning}
By tuning a model we want to \textbf{increase predictive performance of a model} by selecting optimal hyper-parameters (tweaking the learning algorithm)
SVM models have several parameters that are used to tune them. We have already mentioned them briefly above. You should also have knowledge from the SVM lecture.
\subsection{Manual tuning}
\task{If you need to, check the different available \href{http://scikit-learn.org/stable/modules/svm.html\#svm-kernels}{kernels} and meaning of \lstinline{C} and \lstinline{gamma} parameters: \href{http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html}{\lstinline{sklearn.svm.SVC}}}
\task{Try to find a better setting for the \lstinline{C} and \lstinline{gamma} parameters of SVM by hand. What does ``a better setting'' actually mean?}
\subsection{Automatic tuning via grid search}
Clearly, to guess the best parameters manually is impractical. Let's try to use one of the automatic techniques, \textbf{grid search}, to find optimal SVM's parameters.
\task{Learn about the \href{http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html}{\lstinline{GridSearchCV()}} function in the documentation.}
% When using the grid search facility, you may also need some other information about \href{http://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules}{how to construct your own scoring function} and use it e.g. in grid search.
\task{Use the GridSearchCV() to find near-optimal values of \lstinline{C} and \lstinline{gamma} for SVM with RBF kernel. \emph{Use only the training part of data to search for the parameters values!}}
\task{Print out the scores of the final classifier on training and testing data.}
\noindent \textbf{Dealing the datasets correctly with respect to the learning algorithms is a \emph{crucial} thing in data analysis. Think about it thoroughly!}
\section{Model diagnostics}
For educational purposes we will use a synthetic Ripley's dataset here instead of the auto-mpg dataset.
The dataset is composed of two classes, where each comes from bimodal normal distribution with the same variance of $0.04$.
Positive cases are generated from: $\mu_1^+ = [0.4,0.7]$, $\mu_2^+ = [-0.3,0.7]$ and negative cases
from: $\mu_3^- = [-0.7,0.3]$, $\mu_4^- = [0.3,0.3]$. The dataset is represent by features $X_1$ and $X_2$.
\task{Run the script ML04-2.py to have an idea how the dataset looks like.}
\subsection{Learning curves}
Learning curve is a technique that help us to diagnose if a learning algorithm has a problem
with underfitting (high bias) or overfitting (high variance). Further, it can be used to investigate
if obtaining more data samples would help to reach better performance for a given algorithm.
\task{The script ML04-2.py only plots a learning curve but its computation is not implemented yet.
Implement function \lstinline{compute_learning_curve()} that takes the following arguments:
}
\begin{itemize}
\item selected \lstinline{model} and \lstinline{tr_sizes} (the sizes of training data that are used to train model)
\item \lstinline{Xtr, ytr, Xtst, ytst} training/test data and labels
\item Output 1: \lstinline{train_errors} an array corresponding to the number of training examples
\item Output 2: \lstinline{test_errors} an array corresponding to the number of training examples
\end{itemize}
\noindent\textbf{Hints:}
\begin{itemize}
\item To simulate the increasing training dataset size, the function uses increasingly larger part of the training data.
\item To compute the error, you can use the function 1 - metrics.accuracy\_score().
\end{itemize}
\task{Try to display the learning curve several times, it is a stochastic process. Try to
display learning curves for different classifiers, e.g. use SVM with $C=1$ and change the parameter \lstinline{gamma}. Can you provide examples of underfitting, overfitting, and optimal classifier?
For which classifier it would be helpful to get more data?}
\section{Validation curves}
Validation curves are very closely related to learning curves but instead of plotting the error as function of training sizes, the error is plotted as a function
of model parameters (e.g. \lstinline{gamma} for SVM). The validation curve thus provide information about underfitting and overfitting with respect to the model parameters.
\task{Implement computation of validation curve (\lstinline{compute_validation_curve()}) that takes the following arguments:
}
\begin{itemize}
\item selected \lstinline{model}, parameter name \lstinline{param_name} and parameter values \lstinline{param_range}
\item \lstinline{Xtr, ytr, Xtst, ytst} training/test data and labels
\item Output 1: \lstinline{train_errors} an array corresponding to the number of training examples
\item Output 2: \lstinline{test_errors} an array corresponding to the number of training examples
\end{itemize}
\task{Try to play with the validation curves, e.g. use SVM and vary parameter \lstinline{gamma}. Try to interpret the results and identify regions of underfitting. overfitting, and optimal performance.}
\section{Have fun!}
\textbf{Complete the exercise as a homework, ask questions on the forum, and upload the solution via Upload system (BRTUE)!}
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: