mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anna Lahoud <annalah...@gmail.com>
Subject Re: Need to reduce execution time of RowSimilarityJob
Date Tue, 18 Sep 2012 18:13:31 GMT
It can take awhile to tune the parameters but you definitely should be able
to run the RowSimilarityJob in minutes on a set around that size. Multiple
reducers is really important, as is the density of the input matrix.
Another useful tuning parameter we used was to adjust some of the
RowSimilarityJob configuration settings like setting max similarity outputs
per row.

On Tue, Sep 18, 2012 at 12:34 PM, Sean Owen <srowen@gmail.com> wrote:

> That sounds quite slow. You're definitely computing item-item similarity?
> if users are rows, then this job is computing user-user similarity.
>
> An item-based recommender isn't necessary per se, just item similarity. The
> ItemBasedRecommender has a convenience method to just find the top N most
> similar items. If your scale is such that working in memory is feasible,
> that is by far the best answer.
>
> Sean
>
> On Tue, Sep 18, 2012 at 4:14 PM, yamo93 <yamo93@gmail.com> wrote:
>
> > Hi Sean,
> >
> > My need is to compute document similarity (30.000 docs) and more
> > precisely, to find the n most similar docs.
> > As written above, i use RowSimilarityJob but it takes 2h+ to compute.
> >
> > Seb suggest to use an item-item recommender with input data (term,
> > document, tf-idf).
> >
> > Rgds,
> > Y.
> >
> >
> > On 09/18/2012 04:21 PM, Sean Owen wrote:
> >
> >> If you are computing user-user similarity, the number of items is not
> >> nearly as important as the number of users. If you have 1M users, then
> >> computing about 500 billion user-user similarities is going to take a
> long
> >> time no matter what.
> >>
> >> CSV is the input for both Hadoop-based and non-Hadoop-based
> >> implementations. The Hadoop-based implementation converts to vectors.
> You
> >> can inject vectors directly if you want, there. But you need CSV for the
> >> non-Hadoop code.
> >>
> >> There are a number of tuning params in the Hadoop implementation (and
> >> similar but different hooks in the non-Hadoop implementation) that let
> you
> >> prune data at several stages. This is the most important thing for
> speed.
> >> Yes, removing stop-words falls in that category.
> >>
> >> Tuning the JVM helps but marginally. More Hadoop nodes helps, linearly.
> >>
> >> On Tue, Sep 18, 2012 at 1:49 PM, yamo93 <yamo93@gmail.com> wrote:
> >>
> >>  Hi,
> >>>
> >>> I have 30.000 items and the computation takes more than 2h on a
> >>> pseudo-cluster, which is too long in my case.
> >>>
> >>> I think of some ways to reduce the execution time of RowSimilarityJob
> and
> >>> I wonder if some of you have implemented them and how, or explored
> other
> >>> ways.
> >>> 1. tune the JVM
> >>> 2. developing an in memory implementation (i.e. without hadoop)
> >>> 3. reduce the size of the matrix (by removing those which have no words
> >>> in
> >>> common, for example)
> >>> 4. run on real hadoop cluster with several nodes (does anyone have an
> >>> idea
> >>> of the number of nodes to make it interesting)
> >>>
> >>> Thanks for your help,
> >>> Yann.
> >>>
> >>>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message