Differences

This shows you the differences between two versions of the page.

--- courses:be5b33prg:homeworks:spam:data [2015/11/24 16:04]
xposik
+++ courses:be5b33prg:homeworks:spam:data [2015/11/24 16:12]
xposik
@@ Line 1: / Line 1: @@
 ======Data format======
-During this assignment you will work with data sets of emails, which will also contain meta-data about the emails. Such a set of data is usually called [[wp>Text_corpus|corpus]]. In our case, the meta-data for our emails may contain the information whether it is a spam or not and/or what the decision of the spam filter is.
+During this assignment you will work with data sets of emails, which will also contain meta-data about the emails. Such a set of data is usually called a [[wp>Text_corpus|corpus]]. In our case, the meta-data will contain information whether a particular email is a spam or not, and/or whether a spam filter thinks that the email is spam or not.
 You are given two sets of data to work with, they both come from the same source.
@@ Line 7: / Line 7: @@
 </WRAP>
-So, our email corpus will be:
+We shall use the following **convention**: our email corpus will be
-  * a folder, where every file is considered an email with the exception of
+  * a folder, where every file contains a single email message in a text form, with the exception of
-  * the //!truth.txt// file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and
+  * the ''!truth.txt'' file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and
-  * the //!prediction.txt// file, which has the same structure as //!truth.txt// and contains the spam filter prediction for the respective email message file.
+  * the ''!prediction.txt'' file, which has the same structure as ''!truth.txt'', but contains the spam filter predictions/decisions for the respective email message file.
 Of course, these two files do not have to be present in the corpus directory:
-  - Spam filter itself does not need either of them to work and decide. However, it should be able to create //!prediction.txt// containing its predictions.
+  - Spam filter itself needs neither of them to work and decide. However, the spam filter should be able to create ''!prediction.txt'' with the correct structure containing its predictions.
-  - If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a //training corpus//, i.e. a corpus containing the //!truth.txt// file. (Because otherwise you would not know which emails are spam...)
+  - If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a //training corpus//, i.e. a corpus containing the ''!truth.txt'' file. (Because otherwise you would not know which emails are spam...)
-  - If we want to evaluate filter quality, we will need both files - //!truth.txt// and //!prediction.txt//. By comparing these files, we can tell how good predictions the filter gives.
+  - And, when we evaluate quality of a filter, we will need both files - ''!truth.txt'' and ''!prediction.txt''. By comparing these files, we can tell how good predictions the filter provides.