Robin,
So here's how (P)FPGrowth looks  from where I see :
FPGrowth reports the support of itemsets individually in that if Item X appears individually
12 times and appears with item Y 10 times (a total of 22 times) AND item Y appears individually
4 times (a total of 14 times) then this is what the output will be (say for minsupport 2)
12 X
10 X Y
4 Y
If the minimum support is 5 then the output will look like :
12 X
10 X Y
if the minimum support is 11 then the output will look like
12 X
if the minimum support is 13 then there will be NO output.
even though all the way along Xs support was 22 and Y's was 14
Even if we want to show just the maximal itemsets (although i would like to see ALL the frequent
itemsets  maximal or not) this output is wrong as with a support of 13 we should still have
seen X(22) and Y(14)
Now Say you add XYZ 11 times
for support 1 you'd see
12 X
10 X Y
11 X Y Z
4 Y
And for support 11 you'd see
12 X
11 X Y Z
Although I'd expect the output (for s=11) to be
33 X
25 Y
21 XY
11 Z
11 XZ
11 YZ
11 XYZ
Hope this helps.
Vipul
On Mar 5, 2011, at 2:13 AM, Robin Anil wrote:
> Hi Vipul Is it possible for you to attach a test data to a JIRA issue for me to investigate
>
> Robin
>
> On Sat, Mar 5, 2011 at 12:09 PM, Vipul Pandey <vipandey@gmail.com> wrote:
> Hi All,
>
>
> I'm running into a different issue with PFP growth now. I see an output like :
>
> $ cat partr00000  grep 1678807047
> 12 1678807047
> 38 1678807047 3159925415
>
> which says that the support (12) for the item (1678807047) is lesser than the support
(38) of a pair containing that item. Needless to say that this is ridiculous.
> I get this even with the Sequential version of FPGrowth.
>
> $ cat partr00000  grep 1441690161
> 12 1441690161 3910019844
> 18 1604285941 1441690161 3910019844
> 75 1441690161
>
>
> I'm sure I'm doing something "crafty" somewhere.
>
> For sequential, I supply the file containing baskets and get the output as a file of
sequences.
>
> I run the following code to read the sequence file and write out the support and itemsets
in plain text :
>
> (MapReduce was written for PFPGrowth output, which is bigger. My reducer is just an
identity reducer)
> @Override
> protected void map(Text key, TopKStringPatterns input, Context context)
> throws IOException, InterruptedException {
> for(Pair<List<String>,Long> pair : input.getPatterns()){
> StringBuffer sb = new StringBuffer();
> for(String item : pair.getFirst())
> sb.append(item).append(" ");
> context.write(new LongWritable(pair.getSecond()), new Text(sb.toString()));
> }
> }
>
> This gives me the output above.
> Is this the right way? Am I doing something wrong while parsing the output?
>
> My command line arguments are :
> i ./baskets/partr00000 o ./patterns k 50 method sequential g 10 regex '[\t]'
s 10
>
> Any help would be highly appreciated.
>
> Regards,
> Vipul
>
>
>
>
> On Feb 3, 2011, at 6:44 PM, <praveen.peddi@nokia.com> <praveen.peddi@nokia.com>
wrote:
>
> > Hi Vipul,
> > Frquent patterns are reported per feature which is why you are seeing the two patterns
twice. First one is for feature 1518311 and second one is for feature 1476937.
> >
> > However both should have the same exact support. I am not sure why you have different
support for the same item set. May be if you send the full output from Mahout as it is we
could take a look.
> >
> > Are you running on multi node Hadoop cluster. If so did you read all the output
files?
> >
> > Praveen
> > ________________________________________
> > From: ext Vipul Pandey [vipandey@gmail.com]
> > Sent: Thursday, February 03, 2011 8:21 PM
> > To: user@mahout.apache.org
> > Subject: PFPGrowth  weird output?
> >
> > Hi all!
> >
> > I'm trying to run PFPgrowth on my data and this is an output I get. (Please
> > note that I parse the output in frequentpatterns folder and generate this
> > output with the support followed by the itemset)
> >
> > support : Itemset
> > *234 1518311 1476937 *
> > 235 55843184
> > 238 1238079
> > 244 34541
> > 247 4516454
> > 252 106478
> > 252 670864
> > *254 1476937 1518311 *
> >
> > You can see that two items are reported twice (*1518311 1476937*) with
> > different supports.
> >
> > And below are all the occurance of these two items together .... if you
> > notice it has all the permutations of the three items (*1476937* *720020* *
> > 1518311* )
> >
> > 22 *1476937* 720020 *1518311*
> > 30 *1518311* *1476937* 720020
> > 30 720020 *1518311* *1476937*
> > 34 720020 *1476937* *1518311*
> > 38 *1518311* 720020 *1476937*
> > 42 *1476937* *1518311* 720020
> > 234 *1518311* *1476937*
> > 254 *1476937* *1518311*
> >
> > Does this mean if I have to get the support of just the the pair (*1476937*
> > *1518311* ) I will have to add all of them up !?
> >
> > Even in that case ... this total comes out to *684* and if I count the
> > number of coocurrances of these two items in the original baskets the
> > support is *766*? Why's there a difference? any idea?
> >
> >
> > Thanks!
> > Vipul
>
>

Mime 
 Unnamed multipart/alternative (inline, None, 0 bytes)
 Unnamed multipart/mixed (inline, None, 0 bytes)
