mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Need to reduce execution time of RowSimilarityJob
Date Tue, 18 Sep 2012 16:34:14 GMT
That sounds quite slow. You're definitely computing item-item similarity?
if users are rows, then this job is computing user-user similarity.

An item-based recommender isn't necessary per se, just item similarity. The
ItemBasedRecommender has a convenience method to just find the top N most
similar items. If your scale is such that working in memory is feasible,
that is by far the best answer.

Sean

On Tue, Sep 18, 2012 at 4:14 PM, yamo93 <yamo93@gmail.com> wrote:

> Hi Sean,
>
> My need is to compute document similarity (30.000 docs) and more
> precisely, to find the n most similar docs.
> As written above, i use RowSimilarityJob but it takes 2h+ to compute.
>
> Seb suggest to use an item-item recommender with input data (term,
> document, tf-idf).
>
> Rgds,
> Y.
>
>
> On 09/18/2012 04:21 PM, Sean Owen wrote:
>
>> If you are computing user-user similarity, the number of items is not
>> nearly as important as the number of users. If you have 1M users, then
>> computing about 500 billion user-user similarities is going to take a long
>> time no matter what.
>>
>> CSV is the input for both Hadoop-based and non-Hadoop-based
>> implementations. The Hadoop-based implementation converts to vectors. You
>> can inject vectors directly if you want, there. But you need CSV for the
>> non-Hadoop code.
>>
>> There are a number of tuning params in the Hadoop implementation (and
>> similar but different hooks in the non-Hadoop implementation) that let you
>> prune data at several stages. This is the most important thing for speed.
>> Yes, removing stop-words falls in that category.
>>
>> Tuning the JVM helps but marginally. More Hadoop nodes helps, linearly.
>>
>> On Tue, Sep 18, 2012 at 1:49 PM, yamo93 <yamo93@gmail.com> wrote:
>>
>>  Hi,
>>>
>>> I have 30.000 items and the computation takes more than 2h on a
>>> pseudo-cluster, which is too long in my case.
>>>
>>> I think of some ways to reduce the execution time of RowSimilarityJob and
>>> I wonder if some of you have implemented them and how, or explored other
>>> ways.
>>> 1. tune the JVM
>>> 2. developing an in memory implementation (i.e. without hadoop)
>>> 3. reduce the size of the matrix (by removing those which have no words
>>> in
>>> common, for example)
>>> 4. run on real hadoop cluster with several nodes (does anyone have an
>>> idea
>>> of the number of nodes to make it interesting)
>>>
>>> Thanks for your help,
>>> Yann.
>>>
>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message