Warning
 — courses:be5b33prg:homeworks:spam:step_final [2015/10/15 14:03] (current)xposik created 2015/10/15 14:03 xposik created 2015/10/15 14:03 xposik created Line 1: Line 1: + ====== Spam filter - final step ====== + Implement you own filter, as good as possible. + /** + ​ + [[.unit_testing|Tests]] for step 6: + * tests for step 6 only {{:​courses:​a4b99rph:​cviceni:​spam:​test6_filter.zip|}} or + * together with tests for all preceding steps {{:​courses:​a4b99rph:​cviceni:​spam:​test6_all.zip|}}. + ​ + **/ + + ===== Preparation ===== + Think about the inner mechanisms of your filter. What principles shall your filter use to decide between HAM and SPAM for any particular email? How could such a filter be trained using a training dataset? + + Read through the following section with several tips that may simplify your work on the filter. + + ===== Programming tips ===== + Now we would like to provide a few useful programming tips. You do not need to use anything presented in this section. But if you want... + + ==== Tokenization ==== + If you decide to build a filter that will identify individual words or other important phrases in message body, you will probably need to do some form of text [[wp>​tokenization]]. + + You should definitely read through the [[http://​docs.python.org/​release/​3.2.3/​library/​stdtypes.html#​string-methods|string methods documentation]] to learn what operations can be done with them. We would like to especially note the following: + + * [[http://​docs.python.org/​release/​3.1.3/​library/​stdtypes.html#​str.lower|str.lower()]] and [[http://​docs.python.org/​release/​3.1.3/​library/​stdtypes.html#​str.lower|str.upper()]],​ if you do not want to distinguish between, e.g., '​RESULT',​ '​Result'​ and '​result',​ + * [[http://​docs.python.org/​release/​3.1.3/​library/​stdtypes.html#​str.lstrip|str.lstrip()]],​ [[http://​docs.python.org/​release/​3.1.3/​library/​stdtypes.html#​str.rstrip|str.rstrip()]],​ [[http://​docs.python.org/​release/​3.1.3/​library/​stdtypes.html#​str.strip|str.strip()]],​ if you want to get rid of the spaces from the beginning and the end of a string, + * [[http://​docs.python.org/​release/​3.1.3/​library/​stdtypes.html#​str.startswith|str.startswith()]],​ [[http://​docs.python.org/​release/​3.1.3/​library/​stdtypes.html#​str.endswith|str.endswith()]],​ if you want to check a substring at the beginning and at the end of a string, + * ''​short_str in long_str'',​ if you want to check if one string is contained in another string, + * [[http://​docs.python.org/​release/​3.1.3/​library/​stdtypes.html#​str.find|str.find()]],​ if you want to find the precise position of one string inside another, + * [[http://​docs.python.org/​release/​3.1.3/​library/​stdtypes.html#​str.replace|str.replace()]],​ if you want to ignore some substrings and replace them by e.g. white space or other string,, + * [[http://​docs.python.org/​release/​3.1.3/​library/​stdtypes.html#​str.translate|str.translate()]],​ if you want to replace some character in a string with some other characters, and finally + * [[http://​docs.python.org/​release/​3.1.3/​library/​stdtypes.html#​str.split|str.split()]],​ if you want to split a long string into a list of shorter strings using a specified character as a delimiter. + + Use the above list as a suggestion to possibly useful methods. A simple tokenizer can look e.g. like this: + + + shortphrase = """​ + Lorem ipsum dolor sit amet consectetuer Aliquam orci wisi pretium Praesent. Lorem pellentesque ac ipsum Aenean Sed pretium Pellentesque adipiscing enim at. Nisl tempus tempor interdum dictumst wisi consequat id at eu platea. Volutpat adipiscing Curabitur pede Aenean interdum Nulla congue nibh pharetra Maecenas. Convallis aliquam leo leo nisl mi cursus nisl enim Nam leo. Duis nunc nulla leo augue nunc pretium sit Aenean Nam. + """​ + # Replace dots with spaces + shortphrase = shortphrase.translate(str.maketrans('​.','​ ')) + # Transform the letters to lowercase and cut the string into tokens. + tokens = shortphrase.lower().split() + print(tokens) + ​ + + <​code>​ + ['​lorem',​ '​ipsum',​ '​dolor',​ '​sit',​ '​amet',​ '​consectetuer',​ '​aliquam',​ '​orci',​ '​wisi',​ '​pretium',​ '​praesent',​ '​lorem',​ '​pellentesque',​ '​ac',​ '​ipsum',​ '​aenean',​ '​sed',​ '​pretium',​ '​pellentesque',​ '​adipiscing',​ '​enim',​ '​at',​ '​nisl',​ '​tempus',​ '​tempor',​ '​interdum',​ '​dictumst',​ '​wisi',​ '​consequat',​ '​id',​ '​at',​ '​eu',​ '​platea',​ '​volutpat',​ '​adipiscing',​ '​curabitur',​ '​pede',​ '​aenean',​ '​interdum',​ '​nulla',​ '​congue',​ '​nibh',​ '​pharetra',​ '​maecenas',​ '​convallis',​ '​aliquam',​ '​leo',​ '​leo',​ '​nisl',​ '​mi',​ '​cursus',​ '​nisl',​ '​enim',​ '​nam',​ '​leo',​ '​duis',​ '​nunc',​ '​nulla',​ '​leo',​ '​augue',​ '​nunc',​ '​pretium',​ '​sit',​ '​aenean',​ '​nam'​] + >>> ​ + ​ + + + + ==== Counters ==== + In one of our labs, we demonstrated how a dictionary can be used as a counter. Because this dictionary usecase is quite often, Python contains class ''​[[http://​docs.python.org/​release/​3.2.3/​library/​collections.html?​highlight=collections.counter#​collections.Counter|collections.Counter]]''​ which is derived from dictionary (thus it has the same methods and can be used the same way), but it has a modified behavior. + + * ''​counter['​non-existing key'​]''​ returns 0 and does not produce ''​KeyError''​. (Thus we do not need to use method ''​get()''​.) + + In comparison to the regular dictionary, it also has a few additional methods: + * [[http://​docs.python.org/​release/​3.2.3/​library/​collections.html?​highlight=collections.counter#​collections.Counter.update|Counter.update()]] + * [[http://​docs.python.org/​release/​3.2.3/​library/​collections.html?​highlight=collections.counter#​collections.Counter.most_common|Counter.most_common()]] + * overloaded addition and subtraction - it is thus possible to add or subtract two ''​Counter''​ instances. + + If we shall continue with the above example, the following commands + + counter = Counter(tokens) + print(counter) + print() + print(counter.most_common(10)) + ​ + will produce this output: + <​code>​ + Counter({'​leo':​ 4, '​pretium':​ 3, '​aenean':​ 3, '​nisl':​ 3, '​at':​ 2, '​enim':​ 2, '​nunc':​ 2, '​interdum':​ 2, '​nam':​ 2, '​pellentesque':​ 2, '​lorem':​ 2, '​aliquam':​ 2, '​ipsum':​ 2, '​wisi':​ 2, '​sit':​ 2, '​adipiscing':​ 2, '​nulla':​ 2, '​pharetra':​ 1, '​ac':​ 1, '​pede':​ 1, '​duis':​ 1, '​sed':​ 1, '​eu':​ 1, '​id':​ 1, '​praesent':​ 1, '​volutpat':​ 1, '​platea':​ 1, '​convallis':​ 1, '​congue':​ 1, '​dolor':​ 1, '​tempus':​ 1, '​nibh':​ 1, '​consequat':​ 1, '​cursus':​ 1, '​curabitur':​ 1, '​maecenas':​ 1, '​amet':​ 1, '​dictumst':​ 1, '​mi':​ 1, '​augue':​ 1, '​tempor':​ 1, '​orci':​ 1, '​consectetuer':​ 1}) + + [('​leo',​ 4), ('​pretium',​ 3), ('​aenean',​ 3), ('​nisl',​ 3), ('​at',​ 2), ('​enim',​ 2), ('​nunc',​ 2), ('​interdum',​ 2), ('​nam',​ 2), ('​pellentesque',​ 2)] + ​ + + + ==== Walking through the items of a dictionary (counter) in certain order ==== + Very often we want to access the dictionary (counter) items in an order according to their frequency. In Python, you will probably use function ''​[[http://​docs.python.org/​release/​3.2.3/​library/​functions.html?​highlight=sort#​sorted|sorted()]]''​ to get a sorted list. Let's try what the next commands do: + + counter2 = Counter(counter.most_common(5)) + sorted(counter2) + ​ + Result: + <​code>​ + [('​aenean',​ 3), ('​at',​ 2), ('​leo',​ 4), ('​nisl',​ 3), ('​pretium',​ 3)] + ​ + The results is a list of pairs (token, token frequency) sorted alphabetically according to the token. This is not the desired behavior. + + We want the list sorted by the frequencies. We have to let the ''​sorted()''​ function know the key we want to use for ordering. This can be done e.g. using a lambda function. (At this moment you can simply take it as a recipe how to sort the dictionary items according the values, not the keys.) Moreover, we shall order the ''​sorted()''​ function to create a descending order: + + sorted(counter2,​ key=lambda pair: (pair[1],​pair[0]),​ reverse=True) + ​ + We shall get the wanted result: + <​code>​ + [('​leo',​ 4), ('​pretium',​ 3), ('​nisl',​ 3), ('​aenean',​ 3), ('​at',​ 2)] + ​ + + + ==== Advanced tokenization,​ regular expressions ==== + For advanced or more curious students, let us show the way how Python allows us to work with [[http://​docs.python.org/​release/​3.2.3/​library/​re.html?​highlight=re#​module-re|regular expressions]]. These allow us to describe a textual pattern in a flexible way and subsequently to find all the parts of a string that match the pattern, or to split the string using these parts. You can use regular expressions for tokenization e.g. like this: + + import re + + pattern = r"""​(?​im)[a-zA-Z\$]+[\w\d\$\-'​]*"""​ + compiledre = re.compile(pattern) + tokens = compiledre.findall(message) + counter = Counter(tokens) + ​ + + + ==== Email structure ==== + If you do not want to work with the email in the form of unstructured text, you can use module [[http://​docs.python.org/​release/​3.2.3/​library/​email.html?​highlight=email#​module-email|modul ''​email''​]] and its functions: + * ''​[[http://​docs.python.org/​release/​3.2.3/​library/​email.parser.html#​email.message_from_string|email.message_from_string()]]'',​ ''​[[http://​docs.python.org/​release/​3.2.3/​library/​email.parser.html#​email.message_from_file|email.message_from_file()]]''​ allow to create a Message class instance from a string or file, respectively. + * method ''​[[http://​docs.python.org/​release/​3.2.3/​library/​email.message.html#​email.message.Message.walk|Message.walk()]]''​allows you to systematically parse the email by individual parts. + + + ===== Creating your own filter ===== + Task: + * Create your own working spam filter. ​ + + Why do we need it? + * You will hand it in and it will form part of your evaluation. + + ====Specifications==== + Take some inspiration from simple filters created in [[.:​step4|step 4]]. Inner working of your spam filter is entirely up to you. + + To evaluate your spam filter using our automatic script, it has to be compatible with these [[.:​specifications|specs]].