mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: MinHash Clustering in Mahout
Date Tue, 25 Oct 2011 09:55:04 GMT

On Oct 19, 2011, at 11:38 AM, Varun Thacker wrote:

> I was trying to run the MinHash algorithm on the Reuters data set, so I did
> the following before running MinHashDriver
> 
>   - Get the Reuters dataset
>   - Run org.apache.lucene.benchmark.utils.ExtractReuters to generate
>   reuters-out from reuters-sgm(the downloaded archive)
>   - Run seqdirectory to convert reuters-out to SequenceFile format
>   - Run seq2sparse to convert SequenceFiles to sparse vector format
> 
> I used these instructions from the K-means clustering wiki page.
> 
> This is the command I used to run MinHashDriver
> 
> ./mahout org.apache.mahout.clustering.minhash.MinHashDriver --input
> /home/varun/mahout/sparse/tfidf-vectors/ -o /home/varun/mahout/minhash
> 
> The output file looks something like this:
> 
> 106460162-207863047 /reut2-015.sgm-653.txt
> 106460162-207863047 /reut2-021.sgm-7.txt
> 106460162-207863047 /reut2-013.sgm-307.txt
> 106460162-207863047 /reut2-013.sgm-306.txt
> 106460162-207863047 /reut2-014.sgm-786.txt
> 106460162-207863047 /reut2-013.sgm-304.txt
> 106460162-207863047 /reut2-013.sgm-303.txt
> 106460162-207863047 /reut2-021.sgm-230.txt
> 106460162-207863047 /reut2-012.sgm-548.txt
> 106460162-207863047 /reut2-020.sgm-161.txt
> 106460162-207863047 /reut2-021.sgm-553.txt
> 106460162-207863047 /reut2-013.sgm-299.txt
> 106460162-207863047 /reut2-015.sgm-284.txt
> 106460162-207863047 /reut2-013.sgm-996.txt
> 106460162-207863047 /reut2-021.sgm-441.txt
> 106460162-207863047 /reut2-013.sgm-298.txt
> 106460162-207863047 /reut2-013.sgm-995.txt
> 106460162-207863047 /reut2-015.sgm-521.txt
> 106460162-207863047 /reut2-020.sgm-162.txt
> 106460162-207863047 /reut2-020.sgm-163.txt
> 106460162-207863047 /reut2-013.sgm-296.txt
> ...
> ...
> 
> 
> Is this the correct way of running MinHash.
> 
> If yes then I would update the wiki page
> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering with
> the instructions.
> 
> Otherwise if someone could tell me on what am I doing wrong.

I haven't looked into the code, but I get similar outputs, so I assume it is working.  Might
be good to incorporate this into the build-reuters.sh as well as try it on some other input.

-Grant
Mime
View raw message