This is an old revision of the document!

Spam filter - step 2

Create function compute_confusion_matrix() that will compute and return a confusion matrix based on real classes of emails, and on email classes predicted by a filter.

Preparation

In the article Binary Classification, find and understand the meaning of abbreviations TP, FP, TN, FN.
Take a piece of paper and write down:
- what these abbreviations mean for the spam filtering problem, and
- what we need to know to be able to compute them.
See the documentation for namedtuple.

Specifications

Task:

In module quality.py, create function compute_confusion_matrix().
The function will have 4 input arguments:
- truth_dict, a dictionary with the true correct class of individual emails,
- pred_dict, a dictionary with the class predicted for individual emails by a filter,
- pos_tag (optional, with default value True), a class that will be considered positive, and
- neg_tag (optional, with defualt value False), a class that will be considered negative. Thanks to these optional parameters, the function will be generally usable, not only for the spam filter task with pos_tag=“SPAM” and neg_tag=“OK”).
The function will compute four statistics, TP, TN, FP, FN, needed to evaluate a filter, and will return them as a namedtuple with the following definition:
```
from collections import namedtuple
 
ConfMat = namedtuple('ConfMat', 'tp, tn fp fn')
```

Why do we need it?

Function compute_confusion_matrix() represents the basis for evaluation of the filter performance.

The function can be used in the following way:

>>> cm1 = compute_confusion_matrix({}, {})
>>> print(cm1)
ConfMat(tp=0, tn=0, fp=0, fn=0)

or

>>> truth_dict = {'em1': 'SPAM', 'em2': 'SPAM', 'em3': 'OK', 'em4':'OK'}
>>> pred_dict = {'em1': 'SPAM', 'em2': 'OK', 'em3': 'OK', 'em4':'SPAM'}
>>> cm2 = compute_confusion_matrix(truth_dict, pred_dict, pos_tag='SPAM', neg_tag='OK')
>>> print(cm2)
ConfMat(tp=1, tn=1, fp=1, fn=1)

Note: You can expect that the dictionaries will have the same set of keys. Think about the situation when the keys would be different: what shall the method do?

Table of Contents

Spam filter - step 2

Preparation

Specifications