mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <pat.fer...@gmail.com>
Subject Re: Performance issues in Mahout recommendations
Date Fri, 06 Jun 2014 13:40:09 GMT
In the original case you were using a hadoop command line tools which produces all recs for
all users, not just one. Since the recs are ALL calculated they just need to be stored and
retrieved—very fast. Put them in a DB, when the user visits, show the precalculated recs,
which is as fast as a single DB fetch.

Sebastian talks about the in-memory recommender for one machine and medium sized datasets.
It will produce recommendations for a specific user very fast as long as the data is not too
big in which case the performance drops off.

The third way to do this is to break out the core data structure created by ItemSimilarity
Job, translate the Mahout IDs into your Item IDs and index it with Solr. Then you can use
a user’s history as a query in realtime to Solr, which will return an ordered list of recs.
This scales indefinitely as Solr scales and is very fast. It is also nice because you can
bias result towards metadata like category, genre, catalog section, with the query, not new
nodel creation required. You’ll find a tool to help with this in mahout/examples or here:
https://github.com/pferrel/solr-recommender

One of those should fit, they are all fast in the right environment. They all do require some
background non-realtime model calculation but this is done only periodically.


On Jun 6, 2014, at 5:33 AM, Sebastian Schelter <ssc@apache.org> wrote:

Mahout has single machine and distributed recommenders.


On 06/06/2014 02:31 PM, Warunika Ranaweera wrote:
> I agree with your suggestion though. I have already implemented a Java
> recommender and it performed better. But, due to scalability problems that
> are predicted to occur in the future, we thought of moving to Mahout.
> However, it seems like, for now, it's better to go with the single machine
> implementation.
> 
> Thanks for your suggestions,
> Warunika
> 
> 
> 
> On Fri, Jun 6, 2014 at 3:36 PM, Sebastian Schelter <ssc@apache.org> wrote:
> 
>> 1M ratings take up something like 20 megabytes. This is a datasize where
>> it does not make any sense to use Hadoop. Just try the single machine
>> implementation.
>> 
>> --sebastian
>> 
>> 
>> 
>> 
>> On 06/06/2014 12:01 PM, Warunika Ranaweera wrote:
>> 
>>> Hi Sebastian,
>>> 
>>> Thanks for your prompt response. It's just a sample data set from our
>>> database and it may expand up to 6 million ratings. Since the performance
>>> was low for a smaller data set, I thought it would be even worse for a
>>> larger data set. As per your suggestion, I also applied the same command
>>> on
>>> 1 million user ratings for approx. 6000 users and got the same performance
>>> level.
>>> 
>>> What is the average running time for the Mahout distributed recommendation
>>> job on 1 million ratings? Does it usually take more than 1 minute?
>>> 
>>> Thanks in advance,
>>> Warunika
>>> 
>>> 
>>> On Fri, Jun 6, 2014 at 2:42 PM, Sebastian Schelter <ssc@apache.org>
>>> wrote:
>>> 
>>>  You should not use Hadoop for such a tiny dataset. Use the
>>>> GenericItemBasedRecommender on a single machine in Java.
>>>> 
>>>> --sebastian
>>>> 
>>>> 
>>>> On 06/06/2014 11:10 AM, Warunika Ranaweera wrote:
>>>> 
>>>>  Hi,
>>>>> 
>>>>> I am using Mahout's recommenditembased algorithm on a data set with
>>>>> nearly
>>>>> 10,000 (implicit) user ratings. This is the command I used:
>>>>> *mahout recommenditembased --input ratings.csv --output recommendation
>>>>> 
>>>>> --usersFile users.dat --tempDir temp --similarityClassname
>>>>> SIMILARITY_LOGLIKELIHOOD --numRecommendations 3 *
>>>>> 
>>>>> 
>>>>> Although the output is successfully generated, this process takes
>>>>> nearly 7
>>>>> minutes to produce recommendations for a single user. The Hadoop cluster
>>>>> has 8 nodes and the machine on which Mahout is invoked is an AWS EC2
>>>>> c3.2xlarge server. When I tracked the mapreduce jobs, I noticed that
>>>>> more
>>>>> than one machine is *not* utilized at a time, and the
>>>>> *recommenditembased*
>>>>> 
>>>>> command takes 9 mapreduce jobs altogether with approx. 45 seconds taken
>>>>> per
>>>>> job.
>>>>> 
>>>>> Since the performance is too slow for real time recommendations, it
>>>>> would
>>>>> be really helpful to know whether I'm missing out any additional
>>>>> commands
>>>>> or configurations that enables faster performance.
>>>>> 
>>>>> Thanks,
>>>>> Warunikay
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 



Mime
View raw message