======Data format====== During this assignment you will work with a set of email data, which also can contain meta-data. This set of data is usually called [[wp>Text_corpus|corpus]]. In our case, the meta-data for our emails may contain the information whether it is a spam or not and/or what the decision of the spam filter is. You are given two sets of data to work with, they both come from the same source. {{filelist>:courses:a4b99rph:cviceni:files:spam-data-12-s75-h25.zip&style=table&tableheader=1&tableshowdate=1&tableshowsize=1}} So, our email corpus will be: * a folder, where every file is considered an email with the exception of * the //!truth.txt// file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and * the //!prediction.txt// file, which has the same structure as //!truth.txt// and contains the spam filter prediction for the respective email message file. Of course, these two files do not have to be present in the corpus directory: - Spam filter itself does not need either of them to work and decide. However, it should be able to create //!prediction.txt// containing its predictions. - If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a //training corpus//, i.e. a corpus containing the //!truth.txt// file. (Because otherwise you would not know which emails are spam...) - If we want to evaluate filter quality, we will need both files - //!truth.txt// and //!prediction.txt//. By comparing these files, we can tell how good predictions the filter gives.