Spam filter - final step

Implement you own filter, as good as possible.

Preparation

Think about the inner mechanisms of your filter. What principles shall your filter use to decide between HAM and SPAM for any particular email? How could such a filter be trained using a training dataset?

Read through the following section with several tips that may simplify your work on the filter.

Programming tips

Now we would like to provide a few useful programming tips. You do not need to use anything presented in this section. But if you want…

Tokenization

If you decide to build a filter that will identify individual words or other important phrases in message body, you will probably need to do some form of text tokenization.

You should definitely read through the string methods documentation to learn what operations can be done with them. We would like to especially note the following:

str.lower() and str.upper(), if you do not want to distinguish between, e.g., 'RESULT', 'Result' and 'result',
str.lstrip(), str.rstrip(), str.strip(), if you want to get rid of the spaces from the beginning and the end of a string,
str.startswith(), str.endswith(), if you want to check a substring at the beginning and at the end of a string,
short_str in long_str, if you want to check if one string is contained in another string,
str.find(), if you want to find the precise position of one string inside another,
str.replace(), if you want to ignore some substrings and replace them by e.g. white space or other string,,
str.translate(), if you want to replace some character in a string with some other characters, and finally
str.split(), if you want to split a long string into a list of shorter strings using a specified character as a delimiter.

Use the above list as a suggestion to possibly useful methods. A simple tokenizer can look e.g. like this:

shortphrase = """
Lorem ipsum dolor sit amet consectetuer Aliquam orci wisi pretium Praesent. Lorem pellentesque ac ipsum Aenean Sed pretium Pellentesque adipiscing enim at. Nisl tempus tempor interdum dictumst wisi consequat id at eu platea. Volutpat adipiscing Curabitur pede Aenean interdum Nulla congue nibh pharetra Maecenas. Convallis aliquam leo leo nisl mi cursus nisl enim Nam leo. Duis nunc nulla leo augue nunc pretium sit Aenean Nam.
"""
# Replace dots with spaces
shortphrase = shortphrase.translate(str.maketrans('.',' '))
# Transform the letters to lowercase and cut the string into tokens.
tokens = shortphrase.lower().split()
print(tokens)

['lorem', 'ipsum', 'dolor', 'sit', 'amet', 'consectetuer', 'aliquam', 'orci', 'wisi', 'pretium', 'praesent', 'lorem', 'pellentesque', 'ac', 'ipsum', 'aenean', 'sed', 'pretium', 'pellentesque', 'adipiscing', 'enim', 'at', 'nisl', 'tempus', 'tempor', 'interdum', 'dictumst', 'wisi', 'consequat', 'id', 'at', 'eu', 'platea', 'volutpat', 'adipiscing', 'curabitur', 'pede', 'aenean', 'interdum', 'nulla', 'congue', 'nibh', 'pharetra', 'maecenas', 'convallis', 'aliquam', 'leo', 'leo', 'nisl', 'mi', 'cursus', 'nisl', 'enim', 'nam', 'leo', 'duis', 'nunc', 'nulla', 'leo', 'augue', 'nunc', 'pretium', 'sit', 'aenean', 'nam']
>>>

Counters

In one of our labs, we demonstrated how a dictionary can be used as a counter. Because this dictionary usecase is quite often, Python contains class collections.Counter which is derived from dictionary (thus it has the same methods and can be used the same way), but it has a modified behavior.

counter['non-existing key'] returns 0 and does not produce KeyError. (Thus we do not need to use method get().)

In comparison to the regular dictionary, it also has a few additional methods:

Counter.update()
Counter.most_common()
overloaded addition and subtraction - it is thus possible to add or subtract two Counter instances.

If we shall continue with the above example, the following commands

counter = Counter(tokens)
print(counter)
print()
print(counter.most_common(10))

will produce this output:

Counter({'leo': 4, 'pretium': 3, 'aenean': 3, 'nisl': 3, 'at': 2, 'enim': 2, 'nunc': 2, 'interdum': 2, 'nam': 2, 'pellentesque': 2, 'lorem': 2, 'aliquam': 2, 'ipsum': 2, 'wisi': 2, 'sit': 2, 'adipiscing': 2, 'nulla': 2, 'pharetra': 1, 'ac': 1, 'pede': 1, 'duis': 1, 'sed': 1, 'eu': 1, 'id': 1, 'praesent': 1, 'volutpat': 1, 'platea': 1, 'convallis': 1, 'congue': 1, 'dolor': 1, 'tempus': 1, 'nibh': 1, 'consequat': 1, 'cursus': 1, 'curabitur': 1, 'maecenas': 1, 'amet': 1, 'dictumst': 1, 'mi': 1, 'augue': 1, 'tempor': 1, 'orci': 1, 'consectetuer': 1})

[('leo', 4), ('pretium', 3), ('aenean', 3), ('nisl', 3), ('at', 2), ('enim', 2), ('nunc', 2), ('interdum', 2), ('nam', 2), ('pellentesque', 2)]

Walking through the items of a dictionary (counter) in certain order

Very often we want to access the dictionary (counter) items in an order according to their frequency. In Python, you will probably use function sorted() to get a sorted list. Let's try what the next commands do:

counter2 = Counter(counter.most_common(5))
sorted(counter2)

Result:

[('aenean', 3), ('at', 2), ('leo', 4), ('nisl', 3), ('pretium', 3)]

The results is a list of pairs (token, token frequency) sorted alphabetically according to the token. This is not the desired behavior.

We want the list sorted by the frequencies. We have to let the sorted() function know the key we want to use for ordering. This can be done e.g. using a lambda function. (At this moment you can simply take it as a recipe how to sort the dictionary items according the values, not the keys.) Moreover, we shall order the sorted() function to create a descending order:

sorted(counter2, key=lambda pair: (pair[1],pair[0]), reverse=True)

We shall get the wanted result:

[('leo', 4), ('pretium', 3), ('nisl', 3), ('aenean', 3), ('at', 2)]

Advanced tokenization, regular expressions

For advanced or more curious students, let us show the way how Python allows us to work with regular expressions. These allow us to describe a textual pattern in a flexible way and subsequently to find all the parts of a string that match the pattern, or to split the string using these parts. You can use regular expressions for tokenization e.g. like this:

import re
 
pattern = r"""(?im)[a-zA-Z\$]+[\w\d\$\-']*"""
compiledre = re.compile(pattern)
tokens = compiledre.findall(message)
counter = Counter(tokens)

Email structure

If you do not want to work with the email in the form of unstructured text, you can use module modul ''email'' and its functions:

email.message_from_string(), email.message_from_file() allow to create a Message class instance from a string or file, respectively.
method Message.walk()allows you to systematically parse the email by individual parts.

Creating your own filter

Task:

Create your own working spam filter.

Why do we need it?

You will hand it in and it will form part of your evaluation.

Specifications

Take some inspiration from simple filters created in step 4. Inner working of your spam filter is entirely up to you.

To evaluate your spam filter using our automatic script, it has to be compatible with these specs.

Table of Contents