Data format

During this assignment you will work with a set of email data, which also can contain meta-data. This set of data is usually called corpus. In our case, the meta-data for our emails may contain the information whether it is a spam or not and/or what the decision of the spam filter is.

You are given two sets of data to work with, they both come from the same source.

[n/a: Access denied]

So, our email corpus will be:

Of course, these two files do not have to be present in the corpus directory:

  1. Spam filter itself does not need either of them to work and decide. However, it should be able to create !prediction.txt containing its predictions.
  2. If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a training corpus, i.e. a corpus containing the !truth.txt file. (Because otherwise you would not know which emails are spam…)
  3. If we want to evaluate filter quality, we will need both files - !truth.txt and !prediction.txt. By comparing these files, we can tell how good predictions the filter gives.