Implement you own filter, as good as possible.
Think about the inner mechanisms of your filter. What principles shall your filter use to decide between HAM and SPAM for any particular email? How could such a filter be trained using a training dataset?
Read through the following section with several tips that may simplify your work on the filter.
Now we would like to provide a few useful programming tips. You do not need to use anything presented in this section. But if you want…
If you decide to build a filter that will identify individual words or other important phrases in message body, you will probably need to do some form of text tokenization.
You should definitely read through the string methods documentation to learn what operations can be done with them. We would like to especially note the following:
short_str in long_str, if you want to check if one string is contained in another string,
Use the above list as a suggestion to possibly useful methods. A simple tokenizer can look e.g. like this:
shortphrase = """ Lorem ipsum dolor sit amet consectetuer Aliquam orci wisi pretium Praesent. Lorem pellentesque ac ipsum Aenean Sed pretium Pellentesque adipiscing enim at. Nisl tempus tempor interdum dictumst wisi consequat id at eu platea. Volutpat adipiscing Curabitur pede Aenean interdum Nulla congue nibh pharetra Maecenas. Convallis aliquam leo leo nisl mi cursus nisl enim Nam leo. Duis nunc nulla leo augue nunc pretium sit Aenean Nam. """ # Replace dots with spaces shortphrase = shortphrase.translate(str.maketrans('.',' ')) # Transform the letters to lowercase and cut the string into tokens. tokens = shortphrase.lower().split() print(tokens)
['lorem', 'ipsum', 'dolor', 'sit', 'amet', 'consectetuer', 'aliquam', 'orci', 'wisi', 'pretium', 'praesent', 'lorem', 'pellentesque', 'ac', 'ipsum', 'aenean', 'sed', 'pretium', 'pellentesque', 'adipiscing', 'enim', 'at', 'nisl', 'tempus', 'tempor', 'interdum', 'dictumst', 'wisi', 'consequat', 'id', 'at', 'eu', 'platea', 'volutpat', 'adipiscing', 'curabitur', 'pede', 'aenean', 'interdum', 'nulla', 'congue', 'nibh', 'pharetra', 'maecenas', 'convallis', 'aliquam', 'leo', 'leo', 'nisl', 'mi', 'cursus', 'nisl', 'enim', 'nam', 'leo', 'duis', 'nunc', 'nulla', 'leo', 'augue', 'nunc', 'pretium', 'sit', 'aenean', 'nam'] >>>
In one of our labs, we demonstrated how a dictionary can be used as a counter. Because this dictionary usecase is quite often, Python contains class collections.Counter which is derived from dictionary (thus it has the same methods and can be used the same way), but it has a modified behavior.
counter['non-existing key'] returns 0 and does not produce KeyError. (Thus we do not need to use method get().)
In comparison to the regular dictionary, it also has a few additional methods:
Counter instances.
If we shall continue with the above example, the following commands
counter = Counter(tokens) print(counter) print() print(counter.most_common(10))will produce this output:
Counter({'leo': 4, 'pretium': 3, 'aenean': 3, 'nisl': 3, 'at': 2, 'enim': 2, 'nunc': 2, 'interdum': 2, 'nam': 2, 'pellentesque': 2, 'lorem': 2, 'aliquam': 2, 'ipsum': 2, 'wisi': 2, 'sit': 2, 'adipiscing': 2, 'nulla': 2, 'pharetra': 1, 'ac': 1, 'pede': 1, 'duis': 1, 'sed': 1, 'eu': 1, 'id': 1, 'praesent': 1, 'volutpat': 1, 'platea': 1, 'convallis': 1, 'congue': 1, 'dolor': 1, 'tempus': 1, 'nibh': 1, 'consequat': 1, 'cursus': 1, 'curabitur': 1, 'maecenas': 1, 'amet': 1, 'dictumst': 1, 'mi': 1, 'augue': 1, 'tempor': 1, 'orci': 1, 'consectetuer': 1})
[('leo', 4), ('pretium', 3), ('aenean', 3), ('nisl', 3), ('at', 2), ('enim', 2), ('nunc', 2), ('interdum', 2), ('nam', 2), ('pellentesque', 2)]
Very often we want to access the dictionary (counter) items in an order according to their frequency. In Python, you will probably use function sorted() to get a sorted list. Let's try what the next commands do:
counter2 = Counter(counter.most_common(5)) sorted(counter2)Result:
[('aenean', 3), ('at', 2), ('leo', 4), ('nisl', 3), ('pretium', 3)]
The results is a list of pairs (token, token frequency) sorted alphabetically according to the token. This is not the desired behavior.
We want the list sorted by the frequencies. We have to let the sorted() function know the key we want to use for ordering. This can be done e.g. using a lambda function. (At this moment you can simply take it as a recipe how to sort the dictionary items according the values, not the keys.) Moreover, we shall order the sorted() function to create a descending order:
sorted(counter2, key=lambda pair: (pair[1],pair[0]), reverse=True)We shall get the wanted result:
[('leo', 4), ('pretium', 3), ('nisl', 3), ('aenean', 3), ('at', 2)]
For advanced or more curious students, let us show the way how Python allows us to work with regular expressions. These allow us to describe a textual pattern in a flexible way and subsequently to find all the parts of a string that match the pattern, or to split the string using these parts. You can use regular expressions for tokenization e.g. like this:
import re pattern = r"""(?im)[a-zA-Z\$]+[\w\d\$\-']*""" compiledre = re.compile(pattern) tokens = compiledre.findall(message) counter = Counter(tokens)
If you do not want to work with the email in the form of unstructured text, you can use module modul ''email'' and its functions:
email.message_from_string(), email.message_from_file() allow to create a Message class instance from a string or file, respectively.
Message.walk()allows you to systematically parse the email by individual parts.
Task:
Why do we need it?