mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yamo93 <>
Subject Re: Need to reduce execution time of RowSimilarityJob
Date Tue, 18 Sep 2012 15:14:03 GMT
Hi Sean,

My need is to compute document similarity (30.000 docs) and more 
precisely, to find the n most similar docs.
As written above, i use RowSimilarityJob but it takes 2h+ to compute.

Seb suggest to use an item-item recommender with input data (term, 
document, tf-idf).


On 09/18/2012 04:21 PM, Sean Owen wrote:
> If you are computing user-user similarity, the number of items is not
> nearly as important as the number of users. If you have 1M users, then
> computing about 500 billion user-user similarities is going to take a long
> time no matter what.
> CSV is the input for both Hadoop-based and non-Hadoop-based
> implementations. The Hadoop-based implementation converts to vectors. You
> can inject vectors directly if you want, there. But you need CSV for the
> non-Hadoop code.
> There are a number of tuning params in the Hadoop implementation (and
> similar but different hooks in the non-Hadoop implementation) that let you
> prune data at several stages. This is the most important thing for speed.
> Yes, removing stop-words falls in that category.
> Tuning the JVM helps but marginally. More Hadoop nodes helps, linearly.
> On Tue, Sep 18, 2012 at 1:49 PM, yamo93 <> wrote:
>> Hi,
>> I have 30.000 items and the computation takes more than 2h on a
>> pseudo-cluster, which is too long in my case.
>> I think of some ways to reduce the execution time of RowSimilarityJob and
>> I wonder if some of you have implemented them and how, or explored other
>> ways.
>> 1. tune the JVM
>> 2. developing an in memory implementation (i.e. without hadoop)
>> 3. reduce the size of the matrix (by removing those which have no words in
>> common, for example)
>> 4. run on real hadoop cluster with several nodes (does anyone have an idea
>> of the number of nodes to make it interesting)
>> Thanks for your help,
>> Yann.

View raw message