Warning
This page is located in archive. Go to the latest version of this course pages.

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
courses:be5b33prg:homeworks:spam:step3 [2015/11/25 15:58]
xposik [Preparation]
courses:be5b33prg:homeworks:spam:step3 [2015/12/14 14:14]
xposik [Function ''compute_quality_for_corpus()'']
Line 1: Line 1:
 ====== Spam filter - step 3 ====== ====== Spam filter - step 3 ======
-Create ​a set of classes and functions needed to evaluate the filter quality.+Create ​additional ​functions needed to evaluate the filter quality.
  
-/** 
-<WRAP round download>​ 
-[[.unit_testing|Tests]] for step 3: {{:​courses:​a4b99rph:​cviceni:​spam:​test3_quality.zip|}} 
-</​WRAP>​ 
-**/ 
  
  
  
  
-=====Confusion Matrix===== 
- 
-Task: 
-  * In module ''​confmat.py'',​ create class ''​BinaryConfusionMatrix''​. 
-  * The class shall encapsulate four-tuple of statistics, TP, TN, FP, FN, needed to evaluate a filter. 
-  * During the initialization,​ the class will take parameters ''​pos_tag''​ and ''​neg_tag'',​ i.e. values that shall be considered positive and negative, respectively. (The class will then be generally usable, not only for the spam filter with values ''​SPAM''​ and ''​OK''​). 
-  * After the instance creation, all four statistics shall be set to 0. 
-  * The class shall have method ''​as_dict()''​ which returns the confusion matrix as a dictionary with items ''​tp,​ tn, fp, fn''​. 
-  * The class shall have method ''​update(truth,​ prediction)''​ which increases the value of relevant counter (TP, TN, FP, FN) by 1 based on the comparison of the ''​truth''​ and ''​prediction''​ values with ''​pos_tag''​ and ''​neg_tag''​. Raises a ''​ValueError'',​ if the value of ''​truth''​ or ''​prediction''​ is different from both ''​pos_tag''​ and ''​neg_tag''​. 
-  * The class will have method ''​compute_from_dicts(truth_dict,​ pred_dict)''​ which computes the statistics TP, FP, TN, FN from two dictionaries:​ the first one shall contain the correct classification of emails, the second one shall contain the predictions of the filter. 
- 
-Why do we need it? 
-  * Class ''​BinaryConfusionMatrix''​ represents the basis for evaluation of the filter success. 
-  * The class can be used in the following way:<​code python> 
-    >>>​ cm1 = BinaryConfusionMatrix(pos_tag='​True',​ neg_tag='​False'​) 
-    >>>​ cm1.as_dict() 
-    {'​tp':​ 0, '​tn':​ 0, '​fp':​ 0, '​fn':​ 0} 
-    >>>​ cm1.update('​True',​ '​True'​) 
-    >>>​ cm1.as_dict() 
-    {'​tp':​ 1, '​tn':​ 0, '​fp':​ 0, '​fn':​ 0} 
-</​code>​or<​code python> 
-    >>>​ truth_dict = {'​em1':​ '​SPAM',​ '​em2':​ '​SPAM',​ '​em3':​ '​OK',​ '​em4':'​OK'​} 
-    >>>​ pred_dict = {'​em1':​ '​SPAM',​ '​em2':​ '​OK',​ '​em3':​ '​OK',​ '​em4':'​SPAM'​} 
-    >>>​ cm2 = BinaryConfusionMatrix(pos_tag='​SPAM',​ neg_tag='​OK'​) 
-    >>>​ cm2.compute_from_dicts(truth_dict,​ pred_dict) 
-    >>>​ cm2.as_dict() 
-    {'​tp':​ 1, '​tn':​ 1, '​fp':​ 1, '​fn':​ 1} 
-</​code>​ 
- 
-The class shall have at least 3 public methods: ''​as_dict()'',​ ''​update()''​ and ''​compute_from_dicts()''​. 
-^ as_dict() ^ Returns conf. matrix in the form of dictionary. ^ 
-^ Input: | Nothing. | 
-^ Output: | A dictionary with keys ''​tp,​ tn, fp, fn''​ and their values. | 
-^ Effects: | None. | 
- 
-^ update(truth,​ pred) ^ Increase the value of one of the counters according to the values of ''​truth''​ and ''​pred''​. ^ 
-^ Input: | The true and predicted class. | 
-^ Output: | None. | 
-^ Effects: | An increase of a single counter value TP, TN, FP, FN, or raise a ''​ValueError''​. | 
- 
-^ compute_from_dicts(truth_dict,​ pred_dict) ^ Compute the whole confusion matrix from true classes and predictions. ^ 
-^ Input: | Two dictionaries containing the true and predicted classes for individual emails. | 
-^ Output: | None. | 
-^ Effects: | The items of conf. matrix will be set to the numbers of observed TP, TN, FP, FN. | 
- 
-**Note**: You can expect that the dictionaries will have the same set of keys. Think about the situation when the keys would be different: what shall the method do? 
- 
->​{{page>​courses:​a4b99rph:​internal:​cviceni:​spam:​tyden08#​binaryconfusionmatrix&​editbtn}} 
  
  
Line 64: Line 11:
 Task: Task:
   * Create function ''​quality_score(tp,​ tn, fp, fn)''​ in module ''​quality.py''​.   * Create function ''​quality_score(tp,​ tn, fp, fn)''​ in module ''​quality.py''​.
-  * Function computes the quality score defined during the lab.+  * Function computes the quality score defined during the lab (find it also [[courses:​be5b33prg:​homeworks:​spam:​evaluation#​filter_quality_assessment|here]]).
  
 ^ ''​quality_score(tp,​ tn, fp, fn) ''​ Compute the quality score based on the confusion matrix. ^^ ^ ''​quality_score(tp,​ tn, fp, fn) ''​ Compute the quality score based on the confusion matrix. ^^
Line 76: Line 23:
   * In module ''​quality.py'',​ create function ''​compute_quality_for_corpus(corpus_dir)''​ which evaluates the filter quality based on the information contained in files ''​!truth.txt''​ and ''​!prediction.txt''​ in the given corpus.   * In module ''​quality.py'',​ create function ''​compute_quality_for_corpus(corpus_dir)''​ which evaluates the filter quality based on the information contained in files ''​!truth.txt''​ and ''​!prediction.txt''​ in the given corpus.
   * The true and predicted classification can be read in the form of dictionaries using function ''​read_classification_from_file()''​.   * The true and predicted classification can be read in the form of dictionaries using function ''​read_classification_from_file()''​.
-  * The confusion matrix for the given corpus can be computed from the dictionaries using method ''​compute_from_dicts()'' ​of ''​BinaryConfusionMatrix''​ class.+  * The confusion matrix for the given corpus can be computed from the dictionaries using method ''​compute_confusion_matrix()'' ​function from step 2.
   * The quality score can be computed from the confusion matrix using function ''​quality_score()''​.   * The quality score can be computed from the confusion matrix using function ''​quality_score()''​.
  
courses/be5b33prg/homeworks/spam/step3.txt · Last modified: 2015/12/14 14:24 by xposik