mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Filimon <dangeorge.fili...@gmail.com>
Subject Re: How to segment seq2sparse output into predefined training set and test set?
Date Fri, 04 Jan 2013 13:20:19 GMT
I haven't actually done this myself, but look at
DatasetSplitter.java's MarkPreferenceMapper.
That class is responsible for the partitioning and you can probably
just copy that class and replace the map() so that you look at the
year from the text somehow.

So, while it's not exactly code-free, it's better than writing a new program. :)

On Fri, Jan 4, 2013 at 2:38 AM, Adam Baron <adam.j.baron@gmail.com> wrote:
> I went through the classify-20newsgroups.sh example and now want to use
> Naïve Bayes to classify my own text corpus.  Only difference is that I'd
> prefer to define which documents are in the training set and test set
> versus using the split command.  My team prefers accuracy comparisons
> between in-sample years and out-of-sample years as opposed to a random
> selection across all years.  I don't believe I should run the seq2sparse
> separately for each set since I'd end with different DFs and,
> more concerning, different keys assigned to the same n-gram in
> the dictionary.file-0.
>
> Is there an easy way to achieve this with pre-built Mahout functionality?
>  The only solution that comes to mind is to write a MapReduce program that
> parses through the tfidf-vectors after running seq2sparse and sorts the
> vectors into the separate training set and test set based on some
> variable I put in the vector name.
>
> Thanks,
>         Adam

Mime
View raw message