====== Spam filter - step 3 ====== Create a set of classes and functions needed to evaluate the filter quality. [[.unit_testing|Tests]] for step 3: {{:courses:a4b99rph:cviceni:spam:test3_quality.zip|}} =====Preparation===== * In the article [[http://en.wikipedia.org/wiki/Special:Search/Binary_Classification|Binary Classification]], find and understand the meaning of abbreviations TP, FP, TN, FN. * Take a piece of paper and write down: * what these abbreviations mean for the spam filtering problem, and * what we need to know to be able to compute them. =====Confusion Matrix===== Task: * In module ''confmat.py'', create class ''BinaryConfusionMatrix''. * The class shall encapsulate four-tuple of statistics, TP, TN, FP, FN, needed to evaluate a filter. * During the initialization, the class will take parameters ''pos_tag'' and ''neg_tag'', i.e. values that shall be considered positive and negative, respectively. (The class will then be generally usable, not only for the spam filter with values ''SPAM'' and ''OK''). * After the instance creation, all four statistics shall be set to 0. * The class shall have method ''as_dict()'' which returns the confusion matrix as a dictionary with items ''tp, tn, fp, fn''. * The class shall have method ''update(truth, prediction)'' which increases the value of relevant counter (TP, TN, FP, FN) by 1 based on the comparison of the ''truth'' and ''prediction'' values with ''pos_tag'' and ''neg_tag''. Raises a ''ValueError'', if the value of ''truth'' or ''prediction'' is different from both ''pos_tag'' and ''neg_tag''. * The class will have method ''compute_from_dicts(truth_dict, pred_dict)'' which computes the statistics TP, FP, TN, FN from two dictionaries: the first one shall contain the correct classification of emails, the second one shall contain the predictions of the filter. Why do we need it? * Class ''BinaryConfusionMatrix'' represents the basis for evaluation of the filter success. * The class can be used in the following way: >>> cm1 = BinaryConfusionMatrix(pos_tag='True', neg_tag='False') >>> cm1.as_dict() {'tp': 0, 'tn': 0, 'fp': 0, 'fn': 0} >>> cm1.update('True', 'True') >>> cm1.as_dict() {'tp': 1, 'tn': 0, 'fp': 0, 'fn': 0} >>> truth_dict = {'em1': 'SPAM', 'em2': 'SPAM', 'em3': 'OK', 'em4':'OK'} >>> pred_dict = {'em1': 'SPAM', 'em2': 'OK', 'em3': 'OK', 'em4':'SPAM'} >>> cm2 = BinaryConfusionMatrix(pos_tag='SPAM', neg_tag='OK') >>> cm2.compute_from_dicts(truth_dict, pred_dict) >>> cm2.as_dict() {'tp': 1, 'tn': 1, 'fp': 1, 'fn': 1} The class shall have at least 3 public methods: ''as_dict()'', ''update()'' and ''compute_from_dicts()''. ^ as_dict() ^ Returns conf. matrix in the form of dictionary. ^ ^ Input: | Nothing. | ^ Output: | A dictionary with keys ''tp, tn, fp, fn'' and their values. | ^ Effects: | None. | ^ update(truth, pred) ^ Increase the value of one of the counters according to the values of ''truth'' and ''pred''. ^ ^ Input: | The true and predicted class. | ^ Output: | None. | ^ Effects: | An increase of a single counter value TP, TN, FP, FN, or raise a ''ValueError''. | ^ compute_from_dicts(truth_dict, pred_dict) ^ Compute the whole confusion matrix from true classes and predictions. ^ ^ Input: | Two dictionaries containing the true and predicted classes for individual emails. | ^ Output: | None. | ^ Effects: | The items of conf. matrix will be set to the numbers of observed TP, TN, FP, FN. | **Note**: You can expect that the dictionaries will have the same set of keys. Think about the situation when the keys would be different: what shall the method do? >{{page>courses:a4b99rph:internal:cviceni:spam:tyden08#binaryconfusionmatrix&editbtn}} ===== Function ''quality_score()'' ===== Task: * Create function ''quality_score(tp, tn, fp, fn)'' in module ''quality.py''. * Function computes the quality score defined during the lab. ^ ''quality_score(tp, tn, fp, fn) '' Compute the quality score based on the confusion matrix. ^^ ^ Inputs | A 4-tuple of values TP, TN, FP, FN. | ^ Outputs | A number between 0 and 1 showing the prediction quality measure. | >{{page>courses:a4b99rph:internal:cviceni:spam:tyden08#quality_score&editbtn}} ===== Function ''compute_quality_for_corpus()'' ===== Task: * In module ''quality.py'', create function ''compute_quality_for_corpus(corpus_dir)'' which evaluates the filter quality based on the information contained in files ''!truth.txt'' and ''!prediction.txt'' in the given corpus. * The true and predicted classification can be read in the form of dictionaries using function ''read_classification_from_file()''. * The confusion matrix for the given corpus can be computed from the dictionaries using method ''compute_from_dicts()'' of ''BinaryConfusionMatrix'' class. * The quality score can be computed from the confusion matrix using function ''quality_score()''. Why to we need it? * To compute the quality of a filter and to rank them. ^ ''compute_quality_for_corpus(corpus_dir)'' ^ Compute the quality of predictions for given corpus. ^ ^ Inputs | A corpus directory evaluated by a filter (i.e. a directory containing ''!truth.txt'' and ''!prediction.txt'' files). | ^ Outputs | Quality of the filter as a number between 0 and 1. | >{{page>courses:a4b99rph:internal:cviceni:spam:tyden08#compute_quality_for_corpus&editbtn}}