mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anna Lahoud <>
Subject Re: Need to reduce execution time of RowSimilarityJob
Date Tue, 18 Sep 2012 13:08:51 GMT

I have spent a great deal of time working with it recently and can offer
you some quick tips.

   1. The density of the matrix is the biggest factor. Use sparse vectors
   if you can. It will reduce the time.
   2. Set a larger number of reducers to decrease the processing time per
   node. I have had job failures when a single node cannot merge the results.
   3. If you are using the output from seq2sparse, using the tfidf vectors
   as input can be significantly less dense, depending on the parameters you
   used to run seq2sparse.

Using these suggestions, we were able to get a job that took many hours to
run in well under an hour on a large cluster.


On Tue, Sep 18, 2012 at 8:49 AM, yamo93 <> wrote:

> Hi,
> I have 30.000 items and the computation takes more than 2h on a
> pseudo-cluster, which is too long in my case.
> I think of some ways to reduce the execution time of RowSimilarityJob and
> I wonder if some of you have implemented them and how, or explored other
> ways.
> 1. tune the JVM
> 2. developing an in memory implementation (i.e. without hadoop)
> 3. reduce the size of the matrix (by removing those which have no words in
> common, for example)
> 4. run on real hadoop cluster with several nodes (does anyone have an idea
> of the number of nodes to make it interesting)
> Thanks for your help,
> Yann.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message