Warning
This page is located in archive. Go to the latest version of this course pages.

This is an old revision of the document!


Data format

During this assignment you will work with data sets of emails, which will also contain meta-data about the emails. Such a set of data is usually called a corpus. In our case, the meta-data will contain information whether a particular email is a spam or not, and/or whether a spam filter thinks that the email is spam or not.

You are given two sets of data to work with, they both come from the same source.

[n/a: Access denied]

So, our email corpus will be:

  • a folder, where every file is considered an email with the exception of
  • the !truth.txt file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and
  • the !prediction.txt file, which has the same structure as !truth.txt and contains the spam filter prediction for the respective email message file.

Of course, these two files do not have to be present in the corpus directory:

  1. Spam filter itself does not need either of them to work and decide. However, it should be able to create !prediction.txt containing its predictions.
  2. If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a training corpus, i.e. a corpus containing the !truth.txt file. (Because otherwise you would not know which emails are spam…)
  3. If we want to evaluate filter quality, we will need both files - !truth.txt and !prediction.txt. By comparing these files, we can tell how good predictions the filter gives.
courses/be5b33prg/homeworks/spam/data.1448377594.txt.gz · Last modified: 2015/11/24 16:06 by xposik