mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <>
Subject Re: PFPGrowth on cluster does not distribute work load equally on nodes
Date Thu, 17 Jun 2010 14:01:35 GMT
Hi Bjorn, The  distribution of data is in a skewed manner. Thats a problem
with the algorithm as proposed in the paper . The way around it is to
increase the number of groups parameter. For example, if you have 10K unique
features, try to split it into groups such that there is around 10 features
per split. Each reducer finds the TopK patterns by creating FP-Trees having
predominantly those 10 features. So set the number of groups as 1000


2010/6/16 "Björn Jacobs" <>

> Hallo everyone!
> I am trying to get used to the PFPGrowth in the Mahout packages. I am
> planning to adapt this code to be able to run a parallelized subgroup
> discovery. This is btw the aim of my bachelor thesis, which I am currently
> writing.
> I'm having the problem that the algorithm does not distribute the work load
> equally on the nodes in my cluster. I have 10 nodes and I set the
> as well as the mapred.reduce.tasks variable.
> My problem is, that the "PFP Growth Driver running over
> input/test002/sortedoutput"-Job did the following:
> Node 0 got nearly 100% of the work (finished in 20 minutes)
> Node 1-3 got a very small piece (finished in less than 10 seconds)
> Node 4-14 got nothing and finished execution immediately
> This way one node had to do all the work while the others had nothing to do
> and the job took really long to finish... that's not parallel.
> Is this a bug or do I have to configure something to get this working?
> Thanks a lot!
> Yours,
> Björn Jacobs
> --
> GMX DSL: Internet-, Telefon- und Handy-Flat ab 19,99 EUR/mtl.
> Bis zu 150 EUR Startguthaben inklusive!

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message