I noticed that the conversion of Step 1 prints 0 vectors on the first text file. The dictionary,
instead, is built without problems, so what do you think the problem could be?
-----Messaggio originale-----
Da: Claudia Grieco [mailto:grieco@crmpa.unisa.it]
Inviato: mercoledì 12 gennaio 2011 16.03
A: user@mahout.apache.org
Oggetto: Help with Mahout Classification
Hi everyone,
I’m trying to build a classifier that uses as training input documents taken from a Lucene
Index.
Following the wiki and the examples, I understood I need to do the following:
Step 1)Transform the documents in the Lucene Index in Vector format, like in https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
Step 2)Use the transformed data to train a model
Step 3)Use the model to classify new documents
The problem is I don’t know how to progress from Step 1 to Step 2: the trainer needs formatted
files (“One doc per line, first entry on the line is the label, rest is the evidence”
) while the Driver from Step 1 creates a file containing a term dictionary and another containing
the following text:
SEQ_!org.apache.hadoop.io.LongWritable%org.apache.mahout.math.VectorWritable__*org.apache.hadoop.io.compress.DefaultCodec______;
hÙ¥4iU_7ãŒ(M
I guess there are some steps I’m missing or I’m doing something wrong.
My idea would be to read the documents in the Lucene index and use one of the fields as the
label (es. Category: a document with category “sport” is labeled “sport” in the training
set)
Thanks for your help
Claudia
|