Warning
This page is located in archive.

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
courses:be5b33prg:homeworks:spam:data [2015/11/24 16:06]
xposik
courses:be5b33prg:homeworks:spam:data [2015/11/24 16:12]
xposik
Line 7: Line 7:
 </​WRAP>​ </​WRAP>​
  
-So, our email corpus will be: +We shall use the following **convention**: ​our email corpus will be 
-  * a folder, where every file is considered an email with the exception of +  * a folder, where every file contains a single ​email message in a text form, with the exception of 
-  * the //!truth.txt// file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and +  * the ''​!truth.txt'' ​file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and 
-  * the //!prediction.txt// file, which has the same structure as //!truth.txt// and contains the spam filter ​prediction ​for the respective email message file.+  * the ''​!prediction.txt'' ​file, which has the same structure as ''​!truth.txt'',​ but contains the spam filter ​predictions/​decisions ​for the respective email message file.
  
 Of course, these two files do not have to be present in the corpus directory: Of course, these two files do not have to be present in the corpus directory:
-  - Spam filter itself ​does not need either ​of them to work and decide. However, ​it should be able to create ​//!prediction.txt// containing its predictions.  +  - Spam filter itself ​needs neither ​of them to work and decide. However, ​the spam filter ​should be able to create ​''​!prediction.txt''​ with the correct structure ​containing its predictions.  
-  - If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a //training corpus//, i.e. a corpus containing the //!truth.txt// file. (Because otherwise you would not know which emails are spam...) +  - If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a //training corpus//, i.e. a corpus containing the ''​!truth.txt'' ​file. (Because otherwise you would not know which emails are spam...) 
-  - If we want to evaluate ​filter ​quality, we will need both files - //!truth.txt// and //!prediction.txt//. By comparing these files, we can tell how good predictions the filter ​gives.+  - And, when we evaluate quality ​of a filter, we will need both files - ''​!truth.txt'' ​and ''​!prediction.txt''​. By comparing these files, we can tell how good predictions the filter ​provides.
courses/be5b33prg/homeworks/spam/data.txt · Last modified: 2015/11/24 16:12 by xposik