mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Neumayer <neuma...@idi.ntnu.no>
Subject FPGrowth maybe not producing the right rules
Date Wed, 14 Apr 2010 22:33:29 GMT
Hi.

I was trying to test the fpgrowth algorithm on a toy example and I kind of 
fail
to get the correct results ... so it'd be nice if someone could take a look at
my (admittedly very ad-hoc) code and tell me whether there's something 
wrong with it. I'm running the mahout svn version.

I tried to run the following example (input):

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

With the following code:
  public static void main(String[] args) {
        int minSupport = 2;
        int maxHeapSize = 100;
       
        String input = "inputfile";
        String output = "output";

        FPGrowth<String> fp = new FPGrowth<String>();
        FileSystem fs = new RawLocalFileSystem();
        Configuration conf = new Configuration();

        String pattern = "[\\ ]";
        try {
            fs = FileSystem.get(URI.create(output), conf);

            SequenceFile.Writer writer = null;
            writer = new SequenceFile.Writer(fs, conf, new Path(output), 
Text.class, TopKStringPatterns.class);

            Charset encoding = Charset.forName("UTF-8");

            List<Pair<String, Long>> generateFList = null;
            generateFList = fp.generateFList(new StringRecordIterator(new 
FileLineIterable(new File(input), encoding, false),
                    pattern), minSupport);

            StringRecordIterator transactions = null;

            transactions = new StringRecordIterator(new FileLineIterable(new 
File(input), encoding, false), pattern);


            List<Text> keyList = new LinkedList<Text>();
            List<TopKStringPatterns> valueList = new 
LinkedList<TopKStringPatterns>();

            StringOutputCollector<Text, TopKStringPatterns> collector = new 
StringOutputCollector<Text, TopKStringPatterns>(
                    keyList, valueList);
            StringOutputConverter soc = new StringOutputConverter(collector);

            StatusUpdater updater = new StatusUpdater() {
                @Override
                public void update(String status) {
                    System.out.println(status);
                }
            };

            fp.generateTopKFrequentPatterns(transactions, generateFList, 
minSupport, maxHeapSize, null, soc, updater);
            writer.close();
            fs.close();

            System.out.println("list.size: " + valueList.size());
            HashSet<List<String>> unique = new HashSet<List<String>>();
            for (int i = 0; i < valueList.size(); i++) {

                System.out.println(keyList.get(i) + " / " + valueList.get(i));
                Iterator<Pair<List<String>, Long>> iterator = 
valueList.get(i).iterator();
                while (iterator.hasNext()) {

                    unique.add(iterator.next().getFirst());
                }
            }
            // print a bit more clearly
            Iterator<List<String>> iterator = unique.iterator();
            while (iterator.hasNext()) {
                System.out.println(iterator.next());
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

I wrote another collector class (I know, that was probably not necessary but I 
wanted to use the API a bit):

public class StringOutputCollector<K extends Writable, V extends Writable> 
implements OutputCollector<K, V> {

    private final List<K> keyList;
    

    private final List<V> valueList;

    public StringOutputCollector(List<K> keyList, List<V> valueList) {
        this.keyList = keyList;
        this.valueList = valueList;
    }

    @Override
    public final void collect(K key, V value) throws IOException {
        keyList.add(key);
        valueList.add(value);
    }

}

The output I get is the following:
1 / ([1, 3],3), ([1, 2, 3, 5],2)
5 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2)
3 / ([3],4), ([2, 3, 5],3), ([1, 3],3), ([1, 2, 3, 5],2)
2 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2)

But I think there's more frequent itemsets to find in these transactions (with 
min support 2):

{1}	3
{2}	4
{3}	4
{5}	4
{1, 2}	2
{1, 3}	3
{1, 5}	2
{2, 3}	3
{2, 5}	4
{3, 5}	3
{1, 2, 3}	2
{1, 2, 5}	3
{1, 3, 5}	2
{2, 3, 5}	3
{1, 2, 3, 5}	2

Also, some of them are duplicates with the mahout code. I'm not sure whether 
it's supposed to be like that.

I also tried another textbook example and that didn't produce the same results 
either.

So am I doing something wrong with the topk selection (that's what I maybe 
could think of as a stupid mistake of mine)?

Any hints would be appreciated.

R

Mime
View raw message