Warning
This page is located in archive. Go to the latest version of this course pages.

This is an old revision of the document!


Spam filter - step 2

Create function compute_confusion_matrix() that will compute and return a confusion matrix based on real classes of emails, and on email classes predicted by a filter.

Preparation

  • In the article Binary Classification, find and understand the meaning of abbreviations TP, FP, TN, FN.
  • Take a piece of paper and write down:
    • what these abbreviations mean for the spam filtering problem, and
    • what we need to know to be able to compute them.
  • See the documentation for namedtuple.

Specifications

Task:

  • In module quality.py, create function compute_confusion_matrix().
  • The function will have 4 input arguments:
    • truth_dict, a dictionary with the true correct class of individual emails,
    • pred_dict, a dictionary with the class predicted for individual emails by a filter,
    • pos_tag (optional, with default value True), a class that will be considered positive, and
    • neg_tag (optional, with defualt value False), a class that will be considered negative. Thanks to these optional parameters, the function will be generally usable, not only for the spam filter task with pos_tag=“SPAM” and neg_tag=“OK”).
  • The function will compute four statistics, TP, TN, FP, FN, needed to evaluate a filter, and will return them as a namedtuple with the following definition:
    from collections import namedtuple
     
    ConfMat = namedtuple('ConfMat', 'tp, tn fp fn')

Why do we need it?

  • Function compute_confusion_matrix() represents the basis for evaluation of the filter performance.
  • The function can be used in the following way:
    >>> cm1 = compute_confusion_matrix({}, {})
    >>> print(cm1)
    ConfMat(tp=0, tn=0, fp=0, fn=0)
    or
    >>> truth_dict = {'em1': 'SPAM', 'em2': 'SPAM', 'em3': 'OK', 'em4':'OK'}
    >>> pred_dict = {'em1': 'SPAM', 'em2': 'OK', 'em3': 'OK', 'em4':'SPAM'}
    >>> cm2 = compute_confusion_matrix(truth_dict, pred_dict, pos_tag='SPAM', neg_tag='OK')
    >>> print(cm2)
    ConfMat(tp=1, tn=1, fp=1, fn=1)

Note: You can expect that the dictionaries will have the same set of keys. Think about the situation when the keys would be different: what shall the method do?

courses/be5b33prg/homeworks/spam/step2.1448465592.txt.gz · Last modified: 2015/11/25 16:33 by xposik