CourseWare Wiki
Switch Term
Winter 2024 / 2025
Winter 2023 / 2024
Winter 2022 / 2023
Winter 2021 / 2022
Winter 2020 / 2021
Winter 2019 / 2020
Winter 2018 / 2019
Older
Search
Log In
old
courses
be5b33prg
homeworks
spam
data
Warning
This page is located in archive. Go to the latest version of this
course pages
.
Differences
This shows you the differences between two versions of the page.
View differences:
Side by Side
Inline
Go
Link to this comparison view
Both sides previous revision
Previous revision
2015/11/24 16:12 xposik
2015/11/24 16:06 xposik
2015/11/24 16:04 xposik
2015/11/24 15:54 xposik created
Go
Next revision
Previous revision
2015/11/24 16:12 xposik
2015/11/24 16:06 xposik
2015/11/24 16:04 xposik
2015/11/24 15:54 xposik created
Go
courses:be5b33prg:homeworks:spam:data [2015/11/24 16:04]
xposik
courses:be5b33prg:homeworks:spam:data [2015/11/24 16:12]
xposik
Line 1:
Line 1:
======Data format======
======Data format======
-
During this assignment you will work with data sets of emails, which will also contain meta-data about the emails. Such a set of data is usually called [[wp>Text_corpus|corpus]]. In our case, the meta-data
for our emails may
contain
the
information whether
it
is a spam or not and/or
what the decision of the
spam filter is.
+
During this assignment you will work with data sets of emails, which will also contain meta-data about the emails. Such a set of data is usually called
a
[[wp>Text_corpus|corpus]]. In our case, the meta-data
will
contain information whether
a particular email
is a spam or not
,
and/or
whether a
spam filter
thinks that the email
is
spam or not
.
You are given two sets of data to work with, they both come from the same source.
You are given two sets of data to work with, they both come from the same source.
Line 7:
Line 7:
</WRAP>
</WRAP>
-
So,
our email corpus will be
:
+
We shall use the following **convention**:
our email corpus will be
-
* a folder, where every file
is considered an
email with the exception of
+
* a folder, where every file
contains a single
email
message in a text form,
with the exception of
-
* the
//
!truth.txt
//
file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and
+
* the
''
!truth.txt
''
file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and
-
* the
//
!prediction.txt
//
file, which has the same structure as
//
!truth.txt
// and
contains the spam filter
prediction
for the respective email message file.
+
* the
''
!prediction.txt
''
file, which has the same structure as
''
!truth.txt
'', but
contains the spam filter
predictions/decisions
for the respective email message file.
Of course, these two files do not have to be present in the corpus directory:
Of course, these two files do not have to be present in the corpus directory:
-
- Spam filter itself
does not need either
of them to work and decide. However,
it
should be able to create
//
!prediction.txt
//
containing its predictions.
+
- Spam filter itself
needs neither
of them to work and decide. However,
the spam filter
should be able to create
''
!prediction.txt
'' with the correct structure
containing its predictions.
-
- If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a //training corpus//, i.e. a corpus containing the
//
!truth.txt
//
file. (Because otherwise you would not know which emails are spam...)
+
- If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a //training corpus//, i.e. a corpus containing the
''
!truth.txt
''
file. (Because otherwise you would not know which emails are spam...)
-
-
If
we
want to
evaluate
filter
quality, we will need both files -
//
!truth.txt
//
and
//
!prediction.txt
//
. By comparing these files, we can tell how good predictions the filter
gives
.
+
-
And, when
we evaluate quality
of a filter
, we will need both files -
''
!truth.txt
''
and
''
!prediction.txt
''
. By comparing these files, we can tell how good predictions the filter
provides
.
courses/be5b33prg/homeworks/spam/data.txt
· Last modified: 2015/11/24 16:12 by
xposik