mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: PFP Growth
Date Sat, 18 Sep 2010 19:08:50 GMT
Good advice relative to Mahout as well.  Trying it on a smaller sample will
tell you if it is due to bad scaling or really a hangup.

On Sat, Sep 18, 2010 at 12:03 PM, Mark <static.void.dev@gmail.com> wrote:

>  Thanks. Ill give this a try and see how it performs
>
>
> On 9/18/10 12:01 PM, Neal Richter wrote:
>
>> I suggest you take a sample of your data and run it on these
>> non-hadoop implementations of itemset miners, FPGrowth is one of the
>> available algorithms.
>>
>> http://www.borgelt.net/fpm.html
>>
>> If you have success on a small sample then start upscaling the sample
>> as well as investigate the distributions of your data.
>>
>> - Neal
>>
>> On Sat, Sep 18, 2010 at 12:30 PM, Ted Dunning<ted.dunning@gmail.com>
>>  wrote:
>>
>>> In order to encourage your excellent practice of reposting, I will repeat
>>> my
>>> (non)-answer here.
>>>
>>> -------------------------------------------
>>> I don't know the answer to this, but previously this kind of problem was
>>> caused by highly skewed statistics in the input data.
>>>
>>> If there are things that cooccur with everything, then you will have
>>> problems with the speed of the algorithm.
>>>
>>> Can you say something about the distribution of your data?  Can you post
>>> a
>>> frequency by rank table?
>>>
>>> On Sat, Sep 18, 2010 at 10:37 AM, Mark<static.void.dev@gmail.com>
>>>  wrote:
>>>
>>>   I am trying to run FPGrowth:
>>>>
>>>> /hadoop jar /opt/mahout-0.3/mahout-examples-0.3.job
>>>> org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver -i
>>>> output/product/part-r-00000 -o pfp -method mapreduce -regex [\\t] -s 5
>>>> -g
>>>> 17500 -k 50/
>>>>
>>>> However the 3rd task:/ "Processing FPTree: Bottom Up FP Growth>
>>>>  reduce"/
>>>> will not finish. It's basically stuck at 85% and hasn't budged in over
>>>> an
>>>> hour. The output of the first task outputted there were about 37K
>>>> features
>>>> so I set -g to 17500. Does anyone know whats going on and how I can
>>>> speed
>>>> this up?
>>>>
>>>> Thanks
>>>>
>>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message