Warning
This page is located in archive.

Spam filter - step 3

Create a set of classes and functions needed to evaluate the filter quality.

Tests for step 3: test3_quality.zip

Preparation

  • In the article Binary Classification, find and understand the meaning of abbreviations TP, FP, TN, FN.
  • Take a piece of paper and write down:
    • what these abbreviations mean for the spam filtering problem, and
    • what we need to know to be able to compute them.

Confusion Matrix

Task:

  • In module confmat.py, create class BinaryConfusionMatrix.
  • The class shall encapsulate four-tuple of statistics, TP, TN, FP, FN, needed to evaluate a filter.
  • During the initialization, the class will take parameters pos_tag and neg_tag, i.e. values that shall be considered positive and negative, respectively. (The class will then be generally usable, not only for the spam filter with values SPAM and OK).
  • After the instance creation, all four statistics shall be set to 0.
  • The class shall have method as_dict() which returns the confusion matrix as a dictionary with items tp, tn, fp, fn.
  • The class shall have method update(truth, prediction) which increases the value of relevant counter (TP, TN, FP, FN) by 1 based on the comparison of the truth and prediction values with pos_tag and neg_tag. Raises a ValueError, if the value of truth or prediction is different from both pos_tag and neg_tag.
  • The class will have method compute_from_dicts(truth_dict, pred_dict) which computes the statistics TP, FP, TN, FN from two dictionaries: the first one shall contain the correct classification of emails, the second one shall contain the predictions of the filter.

Why do we need it?

  • Class BinaryConfusionMatrix represents the basis for evaluation of the filter success.
  • The class can be used in the following way:
        >>> cm1 = BinaryConfusionMatrix(pos_tag='True', neg_tag='False')
        >>> cm1.as_dict()
        {'tp': 0, 'tn': 0, 'fp': 0, 'fn': 0}
        >>> cm1.update('True', 'True')
        >>> cm1.as_dict()
        {'tp': 1, 'tn': 0, 'fp': 0, 'fn': 0}
        >>> truth_dict = {'em1': 'SPAM', 'em2': 'SPAM', 'em3': 'OK', 'em4':'OK'}
        >>> pred_dict = {'em1': 'SPAM', 'em2': 'OK', 'em3': 'OK', 'em4':'SPAM'}
        >>> cm2 = BinaryConfusionMatrix(pos_tag='SPAM', neg_tag='OK')
        >>> cm2.compute_from_dicts(truth_dict, pred_dict)
        >>> cm2.as_dict()
        {'tp': 1, 'tn': 1, 'fp': 1, 'fn': 1}

The class shall have at least 3 public methods: as_dict(), update() and compute_from_dicts().

as_dict() Returns conf. matrix in the form of dictionary.
Input: Nothing.
Output: A dictionary with keys tp, tn, fp, fn and their values.
Effects: None.
update(truth, pred) Increase the value of one of the counters according to the values of truth and pred.
Input: The true and predicted class.
Output: None.
Effects: An increase of a single counter value TP, TN, FP, FN, or raise a ValueError.
compute_from_dicts(truth_dict, pred_dict) Compute the whole confusion matrix from true classes and predictions.
Input: Two dictionaries containing the true and predicted classes for individual emails.
Output: None.
Effects: The items of conf. matrix will be set to the numbers of observed TP, TN, FP, FN.

Note: You can expect that the dictionaries will have the same set of keys. Think about the situation when the keys would be different: what shall the method do?

Function ''quality_score()''

Task:

  • Create function quality_score(tp, tn, fp, fn) in module quality.py.
  • Function computes the quality score defined during the lab.
quality_score(tp, tn, fp, fn) Compute the quality score based on the confusion matrix.
Inputs A 4-tuple of values TP, TN, FP, FN.
Outputs A number between 0 and 1 showing the prediction quality measure.

Function ''compute_quality_for_corpus()''

Task:

  • In module quality.py, create function compute_quality_for_corpus(corpus_dir) which evaluates the filter quality based on the information contained in files !truth.txt and !prediction.txt in the given corpus.
  • The true and predicted classification can be read in the form of dictionaries using function read_classification_from_file().
  • The confusion matrix for the given corpus can be computed from the dictionaries using method compute_from_dicts() of BinaryConfusionMatrix class.
  • The quality score can be computed from the confusion matrix using function quality_score().

Why to we need it?

  • To compute the quality of a filter and to rank them.
compute_quality_for_corpus(corpus_dir) Compute the quality of predictions for given corpus.
Inputs A corpus directory evaluated by a filter (i.e. a directory containing !truth.txt and !prediction.txt files).
Outputs Quality of the filter as a number between 0 and 1.
courses/ae4b99rph/labs/spam/step3.txt · Last modified: 2013/11/12 14:56 by xposik