mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <pat.fer...@gmail.com>
Subject Re: Performance Issue using item-based approach!
Date Fri, 09 May 2014 15:36:17 GMT
Can we step back a bit, is speed of query the only issue? Why do you care how long it takes?
This is example data, not yours. Some of the techniques you mention below are Hadoop mapreduce
based approaches. These by their nature are batch oriented. The mapreduce item-based recommender
may take hours to complete but it calculates all recs for all users. This is then expected
to be put into some fast serving component like a database. So the lookup from the database
would be just a column of item ids associated with a user. Very simple and super fast. This
will scale to any size your DB can support. but requires Hadoop be installed. If you want
fast queries you can’t get faster than precomputing them.

If you want to use the in-memory recommender because it allows you to ask for a specific user’s
recommendations (bypassing the need for a DB) it will not scale as far. See if giving it more
memory helps. Why do you need to use the item-based? It is not necessarily any better. Both
still recommend items, use the fastest.

Also remember that your data may have completely different characteristics. Movielens is fine
for experimentation but what is your data like? Will it even fit inside an in-memory recommender.
How many users, items, and interactions (the matrix is usually very sparse)? The larger this
is the more likely the in-memory version won’t work.

Ted’s suggestion for using ItemSimilarityJob to create an indicator matrix then indexing
it with Solr and making queries of a user’s preferences against the indicators will produce
recs in a few milliseconds. This also scales with Solr, which is known to scale quite well.
This requires Hadoop and Solr but not necessarily a DB (though it would be nice).

For experimentation when you are not using your own data I’m not sure I see a problem here.
There are many ways to make it faster but they may require using a different approach—let
your data determine this, not example data.

On May 3, 2014, at 8:27 AM, Najum Ali <najum89@googlemail.com> wrote:

(Resending mail without sending my digital signature)

Hi there, 

I mentioned a problem of using the ItemBasedRecommender. It is so much slower then using UserBasedRecommender.


@Sebastian: You said limiting the precomputation file should work. For example: only 50 similarities
for an Item.  You also said this feature is not included in the precomputation yet.
Although using the MultithreadedBatchItemSimilarities (Mahout 0.9), I saw that the Constructor
accepts following arguments:

/**
 * @param recommender recommender to use
 * @param similarItemsPerItem number of similar items to compute per item
 */
public MultithreadedBatchItemSimilarities(ItemBasedRecommender recommender, int similarItemsPerItem)
{
  this(recommender, similarItemsPerItem, DEFAULT_BATCH_SIZE);
}

And in fact, if I set 15 as similarItemsPerItem, the csv file contains only 15 similar items
per item. Why do you said that this feature is not implemented yet, maybe you meant something
else and I
understood something wrong. Therefore I am confused at bit. The problem is, also with limited
pairs of similar items the user-based approach is much faster:

Using a file with 6040 users and 3706 items:
A UserbasedRecommender with k = 50 takes 331ms.

Itembased takes 1510 ms and with precomputed similarities it takes 836 .. still double as
slow. Is there no possibility to restrict something like „neighborhood size“ in userbased?
I have also tried SamplingCandidateItemsStrategy with e.g 10 on each three first arguments
.. and also tried using CachingSimilarity decorater, but nothing seems to help.

Find a attached java file for this test.

And yea, I am using the GroupLens Movie Data: 1M.

Can the dataset be the fault as Sebastian mentioned before:

>>>>>> In the movielens dataset this is true for almost all pairs of items,
>>>>>> unfortunately. From 3076 items, more than 11 million similarities
are
>>>>>> created. A common approach for that (which is not yet implemented
in
>>>>>> our
>>>>>> precomputation unfortunately) is to only retain the top-k similar
items

Hope of getting some help of u guys.. this is getting very depressing :(

Regards
Najum Ali



Mime
View raw message