mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Najum Ali <naju...@googlemail.com>
Subject Fwd: Performance Issue using item-based approach!
Date Sat, 03 May 2014 15:27:12 GMT
(Resending mail without sending my digital signature)

Hi there, 

I mentioned a problem of using the ItemBasedRecommender. It is so much slower then using UserBasedRecommender.


@Sebastian: You said limiting the precomputation file should work. For example: only 50 similarities
for an Item.  You also said this feature is not included in the precomputation yet.
Although using the MultithreadedBatchItemSimilarities (Mahout 0.9), I saw that the Constructor
accepts following arguments:

/**
  * @param recommender recommender to use
  * @param similarItemsPerItem number of similar items to compute per item
  */
 public MultithreadedBatchItemSimilarities(ItemBasedRecommender recommender, int similarItemsPerItem)
{
   this(recommender, similarItemsPerItem, DEFAULT_BATCH_SIZE);
 }

And in fact, if I set 15 as similarItemsPerItem, the csv file contains only 15 similar items
per item. Why do you said that this feature is not implemented yet, maybe you meant something
else and I
understood something wrong. Therefore I am confused at bit. The problem is, also with limited
pairs of similar items the user-based approach is much faster:

Using a file with 6040 users and 3706 items:
A UserbasedRecommender with k = 50 takes 331ms.

Itembased takes 1510 ms and with precomputed similarities it takes 836 .. still double as
slow. Is there no possibility to restrict something like „neighborhood size“ in userbased?
I have also tried SamplingCandidateItemsStrategy with e.g 10 on each three first arguments
.. and also tried using CachingSimilarity decorater, but nothing seems to help.

Find a attached java file for this test.

And yea, I am using the GroupLens Movie Data: 1M.

Can the dataset be the fault as Sebastian mentioned before:

>>>>>> In the movielens dataset this is true for almost all pairs of items,
>>>>>> unfortunately. From 3076 items, more than 11 million similarities
are
>>>>>> created. A common approach for that (which is not yet implemented
in
>>>>>> our
>>>>>> precomputation unfortunately) is to only retain the top-k similar
items

Hope of getting some help of u guys.. this is getting very depressing :(

Regards
Najum Ali


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message