======Data format====== During this assignment you will work with data sets of emails, which will also contain meta-data about the emails. Such a set of data is usually called a [[wp>Text_corpus|corpus]]. In our case, the meta-data will contain information whether a particular email is a spam or not, and/or whether a spam filter thinks that the email is spam or not. You are given two sets of data to work with, they both come from the same source. {{filelist>:courses:a4b99rph:cviceni:files:spam-data-12-s75-h25.zip&style=table&tableheader=1&tableshowdate=1&tableshowsize=1}} We shall use the following **convention**: our email corpus will be * a folder, where every file contains a single email message in a text form, with the exception of * the ''!truth.txt'' file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and * the ''!prediction.txt'' file, which has the same structure as ''!truth.txt'', but contains the spam filter predictions/decisions for the respective email message file. Of course, these two files do not have to be present in the corpus directory: - Spam filter itself needs neither of them to work and decide. However, the spam filter should be able to create ''!prediction.txt'' with the correct structure containing its predictions. - If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a //training corpus//, i.e. a corpus containing the ''!truth.txt'' file. (Because otherwise you would not know which emails are spam...) - And, when we evaluate quality of a filter, we will need both files - ''!truth.txt'' and ''!prediction.txt''. By comparing these files, we can tell how good predictions the filter provides.