mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Baron <adam.j.ba...@gmail.com>
Subject How to segment seq2sparse output into predefined training set and test set?
Date Fri, 04 Jan 2013 00:38:33 GMT
I went through the classify-20newsgroups.sh example and now want to use
Naïve Bayes to classify my own text corpus.  Only difference is that I'd
prefer to define which documents are in the training set and test set
versus using the split command.  My team prefers accuracy comparisons
between in-sample years and out-of-sample years as opposed to a random
selection across all years.  I don't believe I should run the seq2sparse
separately for each set since I'd end with different DFs and,
more concerning, different keys assigned to the same n-gram in
the dictionary.file-0.

Is there an easy way to achieve this with pre-built Mahout functionality?
 The only solution that comes to mind is to write a MapReduce program that
parses through the tfidf-vectors after running seq2sparse and sorts the
vectors into the separate training set and test set based on some
variable I put in the vector name.

Thanks,
        Adam

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message