Warning
This page is located in archive. Go to the latest version of this course pages. Go the latest version of this page.

Computer Lab 10, Spam filter II

Homework

Finish the homework on files and submit it to the upload system. Deadline is tonight 23:59!

Work on the spam filter task. Submit your solution according the specifications into upload system. Deadline is Dec 6 2019!

Spam filter - steps 1-3

  1. Create a new PyCharm project. Call it, for example, spam_filter.
  2. Download traning/test data from Data, unzip the archive and place it in the root folder of the project you created in the previous step. You should now see folders “1” and “2” in your PyCharm project.
  3. Create a new module utils.py in your project.
  4. Create a new function def read_classification_from_file(fpath) inside the utils.py module. fpath is a string containing file path either to !truth.txt or !prediction.txt file. See how these files are formated in Spam filter - step 1. Implement this function so that is outputs a dictionary where keys are email filenames and values are classifications (“SPAM”/“OK”).
    def read_classification_from_file(fpath):
        """Return a dictionary with email classification
     
        :param fpath: string, path to a text file !truth.txt or !prediction.txt
        :return: dictionary, keys are email filenames, values are their classicifications
        """
  5. Create a new function def write_classification_to_file(cls_dict, fpath) inside the utils.py module. This function takes a dictionary with email classifications and writes it to a file in a pre-defined format. cls_dict is a dictionary with email filenames and their classifications - the exact same structure as the output of the read_classification_from_file() function; fpath is a string with a filepath to a file that should be created (this will be typically !prediction.txt). This is practically an inverse function of the read_classification_from_file(). For more information see Spam filter - step 1.
  6. Create a new module quality.py
  7. Write a function def compute_confusion_matrix(truth_dict, pred_dict, pos_tag = True, neg_tag = False) inside the quality.py module. This function receives a dictionary truth_dict with a ground truth classification (emails manually labeled spam/ok) and a dictionary pred_dict with a classification “guessed” by the spam filter. A spam filter is usually not 100% correct when predicting which emails are spam and which not. This function should, therefore, compare the spam filter estimate with the ground truth and come up with a four-number characteristics:
  • TP (true positives; the cases for which the classifier predicted ‘spam’ and the emails were actually spam)
  • TN (true negates; the cases for which the classifier predicted ‘not spam’ and the emails were actually real)
  • FN (false negatives; the cases for which the classifier predicted ‘not spam’ but the emails were actually spam)
  • FP (false positives; the cases for which the classifier predicted ‘spam’ but the emails were actually real)

This function has two extra parameters pos_tag and neg_tag. They specify how positive and negative cases in the input dictionaries are coded. Typically pos_tag = “SPAM” and neg_tag = “OK”. Output of this function is a namedtuple containing tp, tn, fn, fp. For more information and a few test cases see Spam filter - step 2.

  1. Write a function quality_score(tp, tn, fp, fn) in the quality.py module. It receives 4 integers - tp, tn, fp, fn (described above) - on the input and it outputs a single number - prediction quality measure, defined by the following formula: $ q = \frac{TP + TN}{TP + TN + 10 \cdot FP + FN}$. Note: False positives (Real message is classified as spam) are multiplied by the factor of 10. That is, 1 FP is worth 10 FN. Keep that in mind when implementing your own spam filter.
  2. Write a function compute_quality_for_corpus(corpus_dir) in the quality.py module. This function receives a path to a directory (corpus_dir) where !truth.txt and !prediction.txt are expected. You should utilize the functions from the previous steps in order to read these two files and deduce a prediction quality measure. More information here. See the following diagram for the recommended structure:
  3. Check specifications before uploading your solution into upload system.
courses/be5b33prg/labs/week_10.txt · Last modified: 2019/11/28 13:21 by nemymila