mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Failure to run Clustering example
Date Wed, 06 May 2009 01:15:04 GMT

On May 5, 2009, at 7:11 AM, Shashikant Kore wrote:

> Here is a quick update.
>
> I  wrote simple program to create lucene index from the text files and
> then generate document vectors for these indexed documents.   I ran
> K-means after creating canopies on 100 documents and it returned fine.
>
> Here are some of the problems.
> 1.  As pointed out by Jeff, I need to maintain an external mapping of
> document ID to vector mapping. But this requires some glue code
> outside the clustering. Mahout-65 issue to handle that looks complext.
> Instead, can I just add a label to a vector and then just change the
> decodeVector() and asFormatString() methods to handle the label?
>
> 2. To create canopies for 1000 documents it took almost 75 minutes.
> Though the total number of unique terms in the index is 50,000 each
> vector has less than 100 unique terms. (ie each document vector is a
> sparse vector of cardinality 50,000 and 100 elements.) The hardware is
> admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor.
> Hadoop has one node.  Values of T1 and T2 were 80 and 55 respectively,
> as given in the sample program.

Have you profiled it?  Would be good to see where the issue is coming  
from.


Mime
View raw message