mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Need to reduce execution time of RowSimilarityJob
Date Tue, 18 Sep 2012 13:21:40 GMT
Oh I overlooked that, sorry. You could give it (document,term,tfidf)
pairs instead. If you find it awkward to use a recommender to compute
document similarities, than maybe it would be better to think a about a
custom in-memory implementation.


On 18.09.2012 15:13, yamo93 wrote:
> Thanks,
> 
> I need some explanations :
> GenericItemBasedRecommender needs a FileDataModel with userId, itemId,
> score.
> But i have some text documents and today i use seq2sparse and after
> rowid + rowsimilarity.
> How to call GenericItemBasedRecommender with sparse vectors ?
> 
> Y.
> 
> On 09/18/2012 02:57 PM, Sebastian Schelter wrote:
>> You don't need to develop an in-memory implementation, we already have
>> that.
>>
>> Simply use a GenericItemBasedRecommender and ask it for the most similar
>> items of each item.
>>
>>
>> On 18.09.2012 14:49, yamo93 wrote:
>>> Hi,
>>>
>>> I have 30.000 items and the computation takes more than 2h on a
>>> pseudo-cluster, which is too long in my case.
>>>
>>> I think of some ways to reduce the execution time of RowSimilarityJob
>>> and I wonder if some of you have implemented them and how, or explored
>>> other ways.
>>> 1. tune the JVM
>>> 2. developing an in memory implementation (i.e. without hadoop)
>>> 3. reduce the size of the matrix (by removing those which have no words
>>> in common, for example)
>>> 4. run on real hadoop cluster with several nodes (does anyone have an
>>> idea of ​​the number of nodes to make it interesting)
>>>
>>> Thanks for your help,
>>> Yann.
> 


Mime
View raw message