Warning
This page is located in archive. Go to the latest version of this course pages.

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:be5b33prg:homeworks:spam:data [2015/11/24 16:04]
xposik
courses:be5b33prg:homeworks:spam:data [2015/11/24 16:12] (current)
xposik
Line 1: Line 1:
 ======Data format====== ======Data format======
-During this assignment you will work with data sets of emails, which will also contain meta-data about the emails. Such a set of data is usually called [[wp>​Text_corpus|corpus]]. In our case, the meta-data ​for our emails may contain ​the information whether ​it is a spam or not and/​or ​what the decision of the spam filter is.+During this assignment you will work with data sets of emails, which will also contain meta-data about the emails. Such a set of data is usually called ​[[wp>​Text_corpus|corpus]]. In our case, the meta-data ​will contain information whether ​a particular email is a spam or notand/​or ​whether a spam filter ​thinks that the email is spam or not.
  
 You are given two sets of data to work with, they both come from the same source. You are given two sets of data to work with, they both come from the same source.
Line 7: Line 7:
 </​WRAP>​ </​WRAP>​
  
-So, our email corpus will be: +We shall use the following **convention**: ​our email corpus will be 
-  * a folder, where every file is considered an email with the exception of +  * a folder, where every file contains a single ​email message in a text form, with the exception of 
-  * the //!truth.txt// file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and +  * the ''​!truth.txt'' ​file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and 
-  * the //!prediction.txt// file, which has the same structure as //!truth.txt// and contains the spam filter ​prediction ​for the respective email message file.+  * the ''​!prediction.txt'' ​file, which has the same structure as ''​!truth.txt'',​ but contains the spam filter ​predictions/​decisions ​for the respective email message file.
  
 Of course, these two files do not have to be present in the corpus directory: Of course, these two files do not have to be present in the corpus directory:
-  - Spam filter itself ​does not need either ​of them to work and decide. However, ​it should be able to create ​//!prediction.txt// containing its predictions.  +  - Spam filter itself ​needs neither ​of them to work and decide. However, ​the spam filter ​should be able to create ​''​!prediction.txt''​ with the correct structure ​containing its predictions.  
-  - If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a //training corpus//, i.e. a corpus containing the //!truth.txt// file. (Because otherwise you would not know which emails are spam...) +  - If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a //training corpus//, i.e. a corpus containing the ''​!truth.txt'' ​file. (Because otherwise you would not know which emails are spam...) 
-  - If we want to evaluate ​filter ​quality, we will need both files - //!truth.txt// and //!prediction.txt//. By comparing these files, we can tell how good predictions the filter ​gives.+  - And, when we evaluate quality ​of a filter, we will need both files - ''​!truth.txt'' ​and ''​!prediction.txt''​. By comparing these files, we can tell how good predictions the filter ​provides.
courses/be5b33prg/homeworks/spam/data.1448377471.txt.gz · Last modified: 2015/11/24 16:04 by xposik