======Data format======
During this assignment you will work with data sets of emails, which will also contain meta-data about the emails. Such a set of data is usually called a [[wp>Text_corpus|corpus]]. In our case, the meta-data will contain information whether a particular email is a spam or not, and/or whether a spam filter thinks that the email is spam or not.

You are given two sets of data to work with, they both come from the same source.
<WRAP download>
{{filelist>:courses:a4b99rph:cviceni:files:spam-data-12-s75-h25.zip&style=table&tableheader=1&tableshowdate=1&tableshowsize=1}}
</WRAP>

We shall use the following **convention**: our email corpus will be
  * a folder, where every file contains a single email message in a text form, with the exception of
  * the ''!truth.txt'' file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and
  * the ''!prediction.txt'' file, which has the same structure as ''!truth.txt'', but contains the spam filter predictions/decisions for the respective email message file.

Of course, these two files do not have to be present in the corpus directory:
  - Spam filter itself needs neither of them to work and decide. However, the spam filter should be able to create ''!prediction.txt'' with the correct structure containing its predictions. 
  - If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a //training corpus//, i.e. a corpus containing the ''!truth.txt'' file. (Because otherwise you would not know which emails are spam...)
  - And, when we evaluate quality of a filter, we will need both files - ''!truth.txt'' and ''!prediction.txt''. By comparing these files, we can tell how good predictions the filter provides.