mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <shashik...@gmail.com>
Subject Re: Failure to run Clustering example
Date Tue, 05 May 2009 14:11:39 GMT
Here is a quick update.

I  wrote simple program to create lucene index from the text files and
then generate document vectors for these indexed documents.   I ran
K-means after creating canopies on 100 documents and it returned fine.

Here are some of the problems.
1.  As pointed out by Jeff, I need to maintain an external mapping of
document ID to vector mapping. But this requires some glue code
outside the clustering. Mahout-65 issue to handle that looks complext.
Instead, can I just add a label to a vector and then just change the
decodeVector() and asFormatString() methods to handle the label?

2. To create canopies for 1000 documents it took almost 75 minutes.
Though the total number of unique terms in the index is 50,000 each
vector has less than 100 unique terms. (ie each document vector is a
sparse vector of cardinality 50,000 and 100 elements.) The hardware is
admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor.
Hadoop has one node.  Values of T1 and T2 were 80 and 55 respectively,
as given in the sample program.

I believe I am missing something obvious to make this code run real
fast.  Current performance level is not acceptable.

I looked at SparseVector code. The map of values has Integer and
Double as key and value. Auto-boxing may slow down things but the
existing performance suggests something else. (BTW, I have tried
Trove's primitive collection and found substantial performance gains.
I will run some tests for the same.)

3. I will submit the index generation code after internal approvals.
Also, the code right now is written quickly and requires some work to
bring it to an acceptable level of quality.

Thanks,

--shashi

On Fri, May 1, 2009 at 8:36 PM, Grant Ingersoll <gsingers@apache.org> wrote:
> That sounds reasonable.  You might also look at the (Complementary) Naive
> Bayes stuff, as it has some support for calculating the TF-IDF stuff, but it
> does it from flat files.  It's in the examples part of Mahout.
>
>
> On May 1, 2009, at 5:09 AM, Shashikant Kore wrote:
>
>> Here is my plan to create the document vectors.
>>
>> 1. Create Lucene index for all the text files.
>> 2. Iterate on the terms in the index and assign an ID to each term.
>> 3. For each text file
>>  3a. Get terms of the file.
>>  3b. Get TF-IDF score of each term from the lucene index. In
>> document vector store this score along with ID. The document vector
>> will be a sparse vector.
>>
>> Can this now be given as input to the clustering code?

Mime
View raw message