Computer Lab 09, Spam Filter I

Q/A
Intro to spam filter
Practical exercises

Spam filter task - introduction

Read the problem definition, introduction, specifications
Data and their format
Spam filter: step 1

Practical work

Statistics for numbers in a file

Assume we have a text file (e.g. numbers.txt) containing integers separated by spaces:

1 2 1 3 1 4

In module filestats.py, create function compute_file_statistics() that takes a path to a text file as its argument, reads in all the numbers, and returns a named tuple Statistics with fields mean, median, min, max. The statistics names tuple shall be defined as:

Statistics = namedtuple('Statistics', 'mean median min max')

Suggestions:

You should implement another function, e.g. compute_statistics() that will accept a list of numbers as input and will produce the required data structure. Than, the main function may just read the data in, and pass them to this function.
Note, that for set with even number of items, median is defined as an average of the 2 middle items (when the collection is sorted).

Usage example

>>> from filestats import compute_file_statistics
>>> compute_file_statistics('numbers.txt')
Statistics(mean=2.0, median=1.5, min=1, max=4)

Countries and capitals

Let's have a text file, e.g. capitals.csv (the .csv extension stands for “comma-separated values”) containing a pair of strings on each line. The first string is a name of a country, the second string is a name of its capital:

Czech Republic,Prague
USA,Washington
Germany,Berlin
Russia,Moscow

In module 'geography.py', create function load_capitals() that takes a path to a file containing countries and their capitals as an argument, and reads it into a dictionary.

Usage example

>>> from geography import load_capitals
>>> capitals = load_capitals('capitals.csv')
>>> print(capitals)
{'Czech Republic': 'Prague', 'USA': 'Washington', 'Germany': 'Berlin', 'Russia': 'Moscow'}

The order of the individual key-value pairs may be different.

Collection with unique elements?

In module utils.py, create function all_elements_unique() that checks whether a collection (given as input to the function) has all items unique.

Usage example

>>> from utils import all_elements_unique
>>> all_elements_unique('abcdef')
True
>>> all_elements_unique([1, 2, 7])
True
>>> all_elements_unique('abracadabra')
False
>>> all_elements_unique([1, 1, 2, 7])
False

Unique words

In module texttools.py, create function get_unique_words(fpath1, fpath2) which takes paths to 2 files, and returns a 2 tuple:

set of words which were found in the first file but not in the second, and
set of words found in the second file, but not in the first one.

Example usage

Given e.g. the following files text1.txt and text2.txt

When I was one,
I had just begun.

When I was two,
I was nearly new.

the result of executing the function may look like this:

>>> from texttools import get_unique_words
>>> first, second = get_unique_words('text1.txt', 'text2.txt')
>>> print(first)
{'one', 'had', 'just', 'begun'}
>>> print(second)
{'two', 'nearly', 'new'}

Again, the order of individual words in the printouts of the resulting sets may differ.

Homework

Solve homework homework: working with files. See the deadline in UploadSystem.

And:

Implement step 1 of Spam filter task.
Prepare for step 2.

Table of Contents

Computer Lab 09, Spam Filter I

Spam filter task - introduction

Practical work

Statistics for numbers in a file

Usage example

Countries and capitals

Usage example

Collection with unique elements?

Usage example

Unique words

Example usage

Homework