cz.cvut.felk.newsgroup.preprocess
Class ModelBuilder

java.lang.Object
  extended by cz.cvut.felk.newsgroup.preprocess.ModelBuilder

public class ModelBuilder
extends Object

Helper class used to build a model from thes set of training examples.

This class consumes files from the testing set and produces a model at its output.


Field Summary
private static int LIMIT
           
private  Set<String> partialModel
          Partially constructed model.
 
Constructor Summary
ModelBuilder()
           
 
Method Summary
 Model createModel()
           
 void parseFile(String targetClass, BufferedReader fileContent)
          Parses the given file and extracts the information into the model.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LIMIT

private static final int LIMIT
See Also:
Constant Field Values

partialModel

private Set<String> partialModel
Partially constructed model.

Currently the partial model is a set of words. In order to improve it, you can start "counting the frequency" of each word.

Constructor Detail

ModelBuilder

public ModelBuilder()
Method Detail

parseFile

public void parseFile(String targetClass,
                      BufferedReader fileContent)
               throws IOException
Parses the given file and extracts the information into the model.

Currently the method only takes the set of words in a file regardless of the newsgroup, from which the file comes from. One possible idea is to focus on such words, which have "interesting" statistical distribution among different newsgroups.

Another suggestion: You can to use the natural language parser and extract the subject and verb from the each sentence. See [Project Home]/lib/stanford-parser/ParserDemo.java for inspiration.

Parameters:
targetClass -
fileContent -
Throws:
IOException

createModel

public Model createModel()