mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Claudia Grieco" <gri...@crmpa.unisa.it>
Subject R: Help with Mahout Classification
Date Thu, 13 Jan 2011 10:05:11 GMT
I noticed that the conversion of Step 1 prints 0 vectors on the first text file. The dictionary,
instead, is built without problems, so what do you think the problem could be?


-----Messaggio originale-----
Da: Claudia Grieco [mailto:grieco@crmpa.unisa.it] 
Inviato: mercoledì 12 gennaio 2011 16.03
A: user@mahout.apache.org
Oggetto: Help with Mahout Classification

Hi everyone,

I’m trying to build a classifier that uses as training input documents taken from a Lucene
Index.

Following the wiki and the examples, I understood I need to do the following:

 

Step 1)Transform the documents in the Lucene Index in Vector format, like in https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html

Step 2)Use the transformed data to train a model

Step 3)Use the model to classify new documents

 

The problem is I don’t know how to progress from Step 1 to Step 2: the trainer needs formatted
files (“One doc per line, first entry on the line is the label, rest is the evidence”
) while the Driver from Step 1 creates a file containing a term dictionary and another containing
the following text:

 

SEQ_!org.apache.hadoop.io.LongWritable%org.apache.mahout.math.VectorWritable__*org.apache.hadoop.io.compress.DefaultCodec______;
hÙ¥4iU_7ãŒ(M

 

I guess there are some steps I’m missing or I’m doing something wrong.

My idea would be to read the documents in the Lucene index and use one of the fields as the
label (es. Category: a document with category “sport” is labeled “sport” in the training
set)

 

Thanks for your help

Claudia

 



Mime
View raw message