Finish the homework on files and submit it to the upload system. Deadline is tonight 23:59!
Work on the spam filter task. Submit your solution according the specifications into upload system. Deadline is Dec 6 2019!
utils.py in your project.
def read_classification_from_file(fpath) inside the utils.py module. fpath is a string containing file path either to !truth.txt or !prediction.txt file. See how these files are formated in Spam filter - step 1. Implement this function so that is outputs a dictionary where keys are email filenames and values are classifications (“SPAM”/“OK”).def read_classification_from_file(fpath): """Return a dictionary with email classification :param fpath: string, path to a text file !truth.txt or !prediction.txt :return: dictionary, keys are email filenames, values are their classicifications """
def write_classification_to_file(cls_dict, fpath) inside the utils.py module. This function takes a dictionary with email classifications and writes it to a file in a pre-defined format. cls_dict is a dictionary with email filenames and their classifications - the exact same structure as the output of the read_classification_from_file() function; fpath is a string with a filepath to a file that should be created (this will be typically !prediction.txt). This is practically an inverse function of the read_classification_from_file(). For more information see Spam filter - step 1.
quality.py
def compute_confusion_matrix(truth_dict, pred_dict, pos_tag = True, neg_tag = False) inside the quality.py module. This function receives a dictionary truth_dict with a ground truth classification (emails manually labeled spam/ok) and a dictionary pred_dict with a classification “guessed” by the spam filter. A spam filter is usually not 100% correct when predicting which emails are spam and which not. This function should, therefore, compare the spam filter estimate with the ground truth and come up with a four-number characteristics:
This function has two extra parameters pos_tag and neg_tag. They specify how positive and negative cases in the input dictionaries are coded. Typically pos_tag = “SPAM” and neg_tag = “OK”. Output of this function is a namedtuple containing tp, tn, fn, fp. For more information and a few test cases see Spam filter - step 2.
quality_score(tp, tn, fp, fn) in the quality.py module. It receives 4 integers - tp, tn, fp, fn (described above) - on the input and it outputs a single number - prediction quality measure, defined by the following formula: $ q = \frac{TP + TN}{TP + TN + 10 \cdot FP + FN}$. Note: False positives (Real message is classified as spam) are multiplied by the factor of 10. That is, 1 FP is worth 10 FN. Keep that in mind when implementing your own spam filter.
compute_quality_for_corpus(corpus_dir) in the quality.py module. This function receives a path to a directory (corpus_dir) where !truth.txt and !prediction.txt are expected. You should utilize the functions from the previous steps in order to read these two files and deduce a prediction quality measure. More information here. See the following diagram for the recommended structure: