CourseWare Wiki
Switch Term
Winter 2023 / 2024
Winter 2022 / 2023
Winter 2021 / 2022
Winter 2020 / 2021
Winter 2019 / 2020
Winter 2018 / 2019
Older
Search
Log In
old
courses
be5b33prg
homeworks
spam
data
Warning
This page is located in archive. Go to the latest version of this
course pages
.
Differences
This shows you the differences between two versions of the page.
View differences:
Side by Side
Inline
Go
Link to this comparison view
Both sides previous revision
Previous revision
2015/11/24 16:12 xposik
2015/11/24 16:06 xposik
2015/11/24 16:04 xposik
2015/11/24 15:54 xposik created
Go
2015/11/24 16:12 xposik
2015/11/24 16:06 xposik
2015/11/24 16:04 xposik
2015/11/24 15:54 xposik created
Go
courses:be5b33prg:homeworks:spam:data [2015/11/24 16:06]
xposik
courses:be5b33prg:homeworks:spam:data [2015/11/24 16:12]
xposik
Line 7:
Line 7:
</WRAP>
</WRAP>
-
So,
our email corpus will be
:
+
We shall use the following **convention**:
our email corpus will be
-
* a folder, where every file
is considered an
email with the exception of
+
* a folder, where every file
contains a single
email
message in a text form,
with the exception of
-
* the
//
!truth.txt
//
file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and
+
* the
''
!truth.txt
''
file, which contains a name of a file with an email and an information about its true nature (spam or not), one file per line, and
-
* the
//
!prediction.txt
//
file, which has the same structure as
//
!truth.txt
// and
contains the spam filter
prediction
for the respective email message file.
+
* the
''
!prediction.txt
''
file, which has the same structure as
''
!truth.txt
'', but
contains the spam filter
predictions/decisions
for the respective email message file.
Of course, these two files do not have to be present in the corpus directory:
Of course, these two files do not have to be present in the corpus directory:
-
- Spam filter itself
does not need either
of them to work and decide. However,
it
should be able to create
//
!prediction.txt
//
containing its predictions.
+
- Spam filter itself
needs neither
of them to work and decide. However,
the spam filter
should be able to create
''
!prediction.txt
'' with the correct structure
containing its predictions.
-
- If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a //training corpus//, i.e. a corpus containing the
//
!truth.txt
//
file. (Because otherwise you would not know which emails are spam...)
+
- If you create your own spam filter (or if you want to try to use a machine learning algorithm to build the filter), you will need a //training corpus//, i.e. a corpus containing the
''
!truth.txt
''
file. (Because otherwise you would not know which emails are spam...)
-
-
If
we
want to
evaluate
filter
quality, we will need both files -
//
!truth.txt
//
and
//
!prediction.txt
//
. By comparing these files, we can tell how good predictions the filter
gives
.
+
-
And, when
we evaluate quality
of a filter
, we will need both files -
''
!truth.txt
''
and
''
!prediction.txt
''
. By comparing these files, we can tell how good predictions the filter
provides
.
courses/be5b33prg/homeworks/spam/data.txt
· Last modified: 2015/11/24 16:12 by
xposik