This page is located in archive. Go to the latest version of this course pages.

Spam filter

Spam filtering is a very practical assignment with a large real world application. It also represents certain class of problems, we have to contend with in machine learning.

The problem

In this assignment, your main task is not to create a perfect spam filter. You do not know the methods that would allow you to do that yet. Your task is:

  • To understand the problem, analyze the assignment and decompose it.
  • To create a set of functions and classes in Python, which would help you to use a spam filter (once you create one) and evaluate its quality (compare two spam filters).
  • To create a simple (even a very trivial) spam filter, which could be used in such a framework.

What will you learn?

  • You will see the basic principles of spam filtering in action.
  • You will touch the field of data mining (or, rather text mining).
  • You will see how Python can be employed for a machine processing of textual information.
  • You will have another opportunity to practice Python.


Using this assignment we want to show the following:

  1. For certain problem classes, the program's ability to adapt itself is essential.
  2. Automatic learning also has certain pitfalls that need to be avoided.
  3. There exists a kind of tasks, where it is hard to judge the quality of a solution.


We provide you with 2 sets of data to work with. While the final evaluation of your work will be done using different set of data, your spam filter should work on both. It is also important that you understand the format of the data that we will use; it is described on the page linked above.

courses/be5b33prg/homeworks/spam/start.txt · Last modified: 2018/08/13 09:48 (external edit)