mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Performance issues in Mahout recommendations
Date Fri, 06 Jun 2014 12:33:48 GMT
Mahout has single machine and distributed recommenders.


On 06/06/2014 02:31 PM, Warunika Ranaweera wrote:
> I agree with your suggestion though. I have already implemented a Java
> recommender and it performed better. But, due to scalability problems that
> are predicted to occur in the future, we thought of moving to Mahout.
> However, it seems like, for now, it's better to go with the single machine
> implementation.
>
> Thanks for your suggestions,
> Warunika
>
>
>
> On Fri, Jun 6, 2014 at 3:36 PM, Sebastian Schelter <ssc@apache.org> wrote:
>
>> 1M ratings take up something like 20 megabytes. This is a datasize where
>> it does not make any sense to use Hadoop. Just try the single machine
>> implementation.
>>
>> --sebastian
>>
>>
>>
>>
>> On 06/06/2014 12:01 PM, Warunika Ranaweera wrote:
>>
>>> Hi Sebastian,
>>>
>>> Thanks for your prompt response. It's just a sample data set from our
>>> database and it may expand up to 6 million ratings. Since the performance
>>> was low for a smaller data set, I thought it would be even worse for a
>>> larger data set. As per your suggestion, I also applied the same command
>>> on
>>> 1 million user ratings for approx. 6000 users and got the same performance
>>> level.
>>>
>>> What is the average running time for the Mahout distributed recommendation
>>> job on 1 million ratings? Does it usually take more than 1 minute?
>>>
>>> Thanks in advance,
>>> Warunika
>>>
>>>
>>> On Fri, Jun 6, 2014 at 2:42 PM, Sebastian Schelter <ssc@apache.org>
>>> wrote:
>>>
>>>   You should not use Hadoop for such a tiny dataset. Use the
>>>> GenericItemBasedRecommender on a single machine in Java.
>>>>
>>>> --sebastian
>>>>
>>>>
>>>> On 06/06/2014 11:10 AM, Warunika Ranaweera wrote:
>>>>
>>>>   Hi,
>>>>>
>>>>> I am using Mahout's recommenditembased algorithm on a data set with
>>>>> nearly
>>>>> 10,000 (implicit) user ratings. This is the command I used:
>>>>> *mahout recommenditembased --input ratings.csv --output recommendation
>>>>>
>>>>> --usersFile users.dat --tempDir temp --similarityClassname
>>>>> SIMILARITY_LOGLIKELIHOOD --numRecommendations 3 *
>>>>>
>>>>>
>>>>> Although the output is successfully generated, this process takes
>>>>> nearly 7
>>>>> minutes to produce recommendations for a single user. The Hadoop cluster
>>>>> has 8 nodes and the machine on which Mahout is invoked is an AWS EC2
>>>>> c3.2xlarge server. When I tracked the mapreduce jobs, I noticed that
>>>>> more
>>>>> than one machine is *not* utilized at a time, and the
>>>>> *recommenditembased*
>>>>>
>>>>> command takes 9 mapreduce jobs altogether with approx. 45 seconds taken
>>>>> per
>>>>> job.
>>>>>
>>>>> Since the performance is too slow for real time recommendations, it
>>>>> would
>>>>> be really helpful to know whether I'm missing out any additional
>>>>> commands
>>>>> or configurations that enables faster performance.
>>>>>
>>>>> Thanks,
>>>>> Warunikay
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Mime
View raw message