mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Claudia Grieco" <gri...@crmpa.unisa.it>
Subject R: Help with Mahout Classification
Date Fri, 14 Jan 2011 15:10:01 GMT
Thanks for your kind help.
The file had just the header in it because the program wasn't writing any vector, but I've
found out why and solved the problem (it was unable to obtain any TermFreqVector because my
Lucene Index wasn't storing them :D ).
Do you think SGD will be a better choice? New documents are added to the training set very
often and documents can belong to more than one category (ex. "sport", "italy")
Thanks again :)
Claudia

-----Messaggio originale-----
Da: Ted Dunning [mailto:ted.dunning@gmail.com] 
Inviato: giovedì 13 gennaio 2011 18.22
A: user@mahout.apache.org
Oggetto: Re: Help with Mahout Classification

Sorry to be slow to help.

This file is a sequence file containing an id (a long), a vector (the
document) and it appears to be uncompressed.

Presumably it has more data than just this header.  It is the set of vectors
that you want to use.

As far as training a classifier, you have a couple of options.  The simplest
way to start is probably with the SGD classifiers since most of the other
classifiers expect data to be in a textual format.  You can read this
sequence file directly in a program and, if you have, values for the target
variable can use the vectors you are reading to train the classifier.  The
standard TrainNewsGroups example for SGD should help there except that it
mostly consists of methods for converting documents to vectors (which you
seem to already have done).

You are almost certain to have more questions before this works for you, but
can you see if this gets you a bit down the road?

On Wed, Jan 12, 2011 at 7:02 AM, Claudia Grieco <grieco@crmpa.unisa.it>wrote:

>
> SEQ_!org.apache.hadoop.io.LongWritable%org.apache.mahout.math.VectorWritable__*org.apache.hadoop.io.compress.DefaultCodec______;
> hÙ¥4iU_7ãŒ(M
>


Mime
View raw message