mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Performance gains with changes in distance calculation
Date Fri, 22 May 2009 16:03:56 GMT
Shashi,

I'm glad to see you have demonstrated the improvement made possible by 
that optimization. It is really astounding. I will look over your 
patches immediately.

Jeff

Shashikant Kore wrote:
> Hi,
>
> I am working on clustering a dataset which has thousands of sparse
> vectors. The complete dataset has few tens of thousands of feature
> items but each vector has only couple of hundred feature items. For
> this, there is an optimization in distance calculation, a link to
> which I found the archives of Mahout mailing list.
>
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
>
> I tried out this optimization.  The test setup had 2000 document
> vectors with few hundred items.  I ran canopy generation with
> Euclidean distance and t1, t2 values as 250 and 200.
>
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
>
> I know by experience that using Integer, Double objects instead of
> primitives is computationally expensive. I changed the sparse vector
> implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
>
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
>
> To sum, these two optimizations reduced cluster generation time by a 97%.
>
> Currently, I have made the changes for Euclidean Distance, Canopy and
> KMeans.  How do we go about pushing these changes to Mahout?
>
> --shashi
>
>
>   


Mime
View raw message