======Data format======
During this assignment you will work with a set of email data, which also can contain meta-data. This set of data is usually called [[wp>Text_corpus|corpus]]. In our case, the meta-data for our emails may contain the information whether it is a spam or not and/or what the decision of the spam filter is.

You are given two sets of data to work with, they both come from the same source.
<WRAP download>
{{filelist>:courses:a4b99rph:cviceni:files:spam-data-12-s75-h25.zip&style=table&tableheader=1&tableshowdate=1&tableshowsize=1}}
</WRAP>

So, our email corpus will be:
  * a folder, where every file is considered an email with the exception of
  * the //!truth.txt// file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and
  * the //!prediction.txt// file, which has the same structure as //!truth.txt// and contains the spam filter prediction for the respective email message file.

Of course, these two files do not have to be present in the corpus directory:
  - Spam filter itself does not need either of them to work and decide. However, it should be able to create //!prediction.txt// containing its predictions. 
  - If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a //training corpus//, i.e. a corpus containing the //!truth.txt// file. (Because otherwise you would not know which emails are spam...)
  - If we want to evaluate filter quality, we will need both files - //!truth.txt// and //!prediction.txt//. By comparing these files, we can tell how good predictions the filter gives.