During this assignment you will work with data sets of emails, which will also contain meta-data about the emails. Such a set of data is usually called a corpus. In our case, the meta-data will contain information whether a particular email is a spam or not, and/or whether a spam filter thinks that the email is spam or not.
You are given two sets of data to work with, they both come from the same source. download data
We shall use the following convention: our email corpus will be
!truth.txt
file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and
!prediction.txt
file, which has the same structure as !truth.txt
, but contains the spam filter predictions/decisions for the respective email message file.
Of course, these two files do not have to be present in the corpus directory:
!prediction.txt
with the correct structure containing its predictions.
!truth.txt
file. (Because otherwise you would not know which emails are spam…)
!truth.txt
and !prediction.txt
. By comparing these files, we can tell how good predictions the filter provides.