Spam filter - step 3

Spam filter - step 3

Create a set of classes and functions needed to evaluate the filter quality.

Tests for step 3: test3_quality.zip

Preparation

In the article Binary Classification, find and understand the meaning of abbreviations TP, FP, TN, FN.
Take a piece of paper and write down:
- what these abbreviations mean for the spam filtering problem, and
- what we need to know to be able to compute them.

Confusion Matrix

Task:

In module confmat.py, create class BinaryConfusionMatrix.
The class shall encapsulate four-tuple of statistics, TP, TN, FP, FN, needed to evaluate a filter.
During the initialization, the class will take parameters pos_tag and neg_tag, i.e. values that shall be considered positive and negative, respectively. (The class will then be generally usable, not only for the spam filter with values SPAM and OK).
After the instance creation, all four statistics shall be set to 0.
The class shall have method as_dict() which returns the confusion matrix as a dictionary with items tp, tn, fp, fn.
The class shall have method update(truth, prediction) which increases the value of relevant counter (TP, TN, FP, FN) by 1 based on the comparison of the truth and prediction values with pos_tag and neg_tag. Raises a ValueError, if the value of truth or prediction is different from both pos_tag and neg_tag.
The class will have method compute_from_dicts(truth_dict, pred_dict) which computes the statistics TP, FP, TN, FN from two dictionaries: the first one shall contain the correct classification of emails, the second one shall contain the predictions of the filter.

Why do we need it?

Class BinaryConfusionMatrix represents the basis for evaluation of the filter success.

The class can be used in the following way:

    >>> cm1 = BinaryConfusionMatrix(pos_tag='True', neg_tag='False')
    >>> cm1.as_dict()
    {'tp': 0, 'tn': 0, 'fp': 0, 'fn': 0}
    >>> cm1.update('True', 'True')
    >>> cm1.as_dict()
    {'tp': 1, 'tn': 0, 'fp': 0, 'fn': 0}

    >>> truth_dict = {'em1': 'SPAM', 'em2': 'SPAM', 'em3': 'OK', 'em4':'OK'}
    >>> pred_dict = {'em1': 'SPAM', 'em2': 'OK', 'em3': 'OK', 'em4':'SPAM'}
    >>> cm2 = BinaryConfusionMatrix(pos_tag='SPAM', neg_tag='OK')
    >>> cm2.compute_from_dicts(truth_dict, pred_dict)
    >>> cm2.as_dict()
    {'tp': 1, 'tn': 1, 'fp': 1, 'fn': 1}

The class shall have at least 3 public methods: as_dict(), update() and compute_from_dicts().

as_dict()	Returns conf. matrix in the form of dictionary.
Input:	Nothing.
Output:	A dictionary with keys `tp, tn, fp, fn` and their values.
Effects:	None.

update(truth, pred)	Increase the value of one of the counters according to the values of `truth` and `pred`.
Input:	The true and predicted class.
Output:	None.
Effects:	An increase of a single counter value TP, TN, FP, FN, or raise a `ValueError`.

compute_from_dicts(truth_dict, pred_dict)	Compute the whole confusion matrix from true classes and predictions.
Input:	Two dictionaries containing the true and predicted classes for individual emails.
Output:	None.
Effects:	The items of conf. matrix will be set to the numbers of observed TP, TN, FP, FN.

Note: You can expect that the dictionaries will have the same set of keys. Think about the situation when the keys would be different: what shall the method do?

Function ''quality_score()''

Task:

Create function quality_score(tp, tn, fp, fn) in module quality.py.
Function computes the quality score defined during the lab.

`quality_score(tp, tn, fp, fn)` Compute the quality score based on the confusion matrix.
Inputs	A 4-tuple of values TP, TN, FP, FN.
Outputs	A number between 0 and 1 showing the prediction quality measure.

Function ''compute_quality_for_corpus()''

Task:

In module quality.py, create function compute_quality_for_corpus(corpus_dir) which evaluates the filter quality based on the information contained in files !truth.txt and !prediction.txt in the given corpus.
The true and predicted classification can be read in the form of dictionaries using function read_classification_from_file().
The confusion matrix for the given corpus can be computed from the dictionaries using method compute_from_dicts() of BinaryConfusionMatrix class.
The quality score can be computed from the confusion matrix using function quality_score().

Why to we need it?

To compute the quality of a filter and to rank them.

`compute_quality_for_corpus(corpus_dir)`	Compute the quality of predictions for given corpus.
Inputs	A corpus directory evaluated by a filter (i.e. a directory containing `!truth.txt` and `!prediction.txt` files).
Outputs	Quality of the filter as a number between 0 and 1.

Table of Contents

Spam filter - step 3

Preparation

Confusion Matrix

Function ''quality_score()''

Function ''compute_quality_for_corpus()''