mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Performance gains with changes in distance calculation
Date Fri, 22 May 2009 11:15:11 GMT

On May 22, 2009, at 6:52 AM, Shashikant Kore wrote:

> Hi,
>
> I am working on clustering a dataset which has thousands of sparse
> vectors. The complete dataset has few tens of thousands of feature
> items but each vector has only couple of hundred feature items. For
> this, there is an optimization in distance calculation, a link to
> which I found the archives of Mahout mailing list.
>
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
>
> I tried out this optimization.  The test setup had 2000 document
> vectors with few hundred items.  I ran canopy generation with
> Euclidean distance and t1, t2 values as 250 and 200.
>
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
>

Very cool.

> I know by experience that using Integer, Double objects instead of
> primitives is computationally expensive. I changed the sparse vector
> implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
>
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
>
> To sum, these two optimizations reduced cluster generation time by a  
> 97%.
>
> Currently, I have made the changes for Euclidean Distance, Canopy and
> KMeans.  How do we go about pushing these changes to Mahout?


http://cwiki.apache.org/MAHOUT/howtocontribute.html

It's a bit complicated by Trove, b/c that is LGPL.  What that means,  
unfortunately, is that we can't check it into our code or distribute  
it.  However, if it is in a Maven repo somewhere (I see an old  
version) than it is easier to include.  I haven't looked at the code,  
but is it possible that http://commons.apache.org/primitives/ fills  
the same role or some other library out there that has a more friendly  
license?

Regardless of these, feel free to submit a patch, so we can at least  
look at it and have something concrete to discuss in JIRA.

Thanks,
Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message