mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Performance gains with changes in distance calculation
Date Fri, 22 May 2009 11:15:11 GMT

On May 22, 2009, at 6:52 AM, Shashikant Kore wrote:

> Hi,
> I am working on clustering a dataset which has thousands of sparse
> vectors. The complete dataset has few tens of thousands of feature
> items but each vector has only couple of hundred feature items. For
> this, there is an optimization in distance calculation, a link to
> which I found the archives of Mahout mailing list.
> I tried out this optimization.  The test setup had 2000 document
> vectors with few hundred items.  I ran canopy generation with
> Euclidean distance and t1, t2 values as 250 and 200.
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.

Very cool.

> I know by experience that using Integer, Double objects instead of
> primitives is computationally expensive. I changed the sparse vector
> implementation to used primitive collections by Trove [
> ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a  
> 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and
> KMeans.  How do we go about pushing these changes to Mahout?

It's a bit complicated by Trove, b/c that is LGPL.  What that means,  
unfortunately, is that we can't check it into our code or distribute  
it.  However, if it is in a Maven repo somewhere (I see an old  
version) than it is easier to include.  I haven't looked at the code,  
but is it possible that fills  
the same role or some other library out there that has a more friendly  

Regardless of these, feel free to submit a patch, so we can at least  
look at it and have something concrete to discuss in JIRA.


Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

View raw message