mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rajesh Nikam <rajeshni...@gmail.com>
Subject Re: bottom up clustering
Date Mon, 03 Jun 2013 08:30:44 GMT
Hi Suneel,

I have used seqdirectory followed by seq2sparse on 20newsgroup set.

Then used following command to run streamingkmeans to get 40 clusters.

hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
    -i /user/hadoop/news-vectors/tf-vectors/ \
    -o /user/hadoop/news-stream-kmeans \
  -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
  -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
  -k 40 \
  -km 190 \
  -testp 0.3 \
  -mi 10 \
  -ow

dumped output using  seqdumper from
/user/hadoop/news-stream-kmeans/part-r-00000.

In the dumped file I see centroids are dumped like:

Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
Key: 0: Value: key = 0, weight = 1.00, vector =
{1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0,
Key: 1: Value: key = 1, weight = 3.00, vector =
{1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854
Key: 2: Value: key = 2, weight = 105.00, vector =
{794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0
Key: 3: Value: key = 28, weight = 259.00, vector =
{1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020
--

 more --- >
--

I have tried using arff.vector to covert arff to vector where I dont know
how to covert it to tf-idf vectors format as expected by streaming kmeans ?

Thanks
Rajesh



On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam <rajeshnikam@gmail.com> wrote:

> Hi Suneel,
>
> Thanks a lot for detailed steps !
> I will try out the steps.
>
> Thanks, Ted for pointing this out!
>
> Thanks,
> Rajesh
>
>
> On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi <suneel_marthi@yahoo.com>wrote:
>
>> To add to Ted's reply, streaming k-means was recently added to Mahout
>> (thanks to Dan and Ted).
>>
>> Here's the reference paper that talks about Streaming k-means:
>>
>> http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
>>
>> You have to be working off of trunk to use this, its not available as
>> part of any release yet.
>>
>> The steps for using Streaming k-means (I don't think its been documented
>> yet)
>>
>> 1.  Generate Sparse vectors via seq2sparse (u have this already).
>>
>> 2.  mahout  streamingkmeans  -i <path to tfidf-vectors>  -o <output path>
>> --tempDir <temp folder path> -ow
>>  -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>>  -sc org.apache.mahout.math.neighborhood.FastProjectionSearch
>>  -k <No. of clusters> -km <see below for the math>
>>
>> -k = no of clusters
>> -km = (k * log(n))  where k = no. of clusters and n = no. of datapoints
>> to cluster,  round this to the nearest integer
>>
>> You have option of using a FastProjectionSearch or ProjectionSearch or
>> LocalitySensitiveHashSearch for the -sc parameter.
>>
>>
>>
>>
>>
>>
>>
>> ________________________________
>>  From: Ted Dunning <ted.dunning@gmail.com>
>> To: "user@mahout.apache.org" <user@mahout.apache.org>
>> Cc: "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi <
>> suneel_marthi@yahoo.com>
>> Sent: Thursday, May 30, 2013 12:03 PM
>> Subject: Re: bottom up clustering
>>
>>
>> Rajesh
>>
>> The streaming k-means implementation is very much like what you are
>> asking for.  The first pass is to cluster into many, many clusters and then
>> cluster those clusters.
>>
>> Sent from my iPhone
>>
>> On May 30, 2013, at 11:20, Rajesh Nikam <rajeshnikam@gmail.com> wrote:
>>
>> > Hello Suneel,
>> >
>> > I got it. Next step to canopy is to feed these centroids to kmeans and
>> > cluster.
>> >
>> > However I want is to use centroids from these clusters and do
>> clustering on
>> > them so as to find related clusters.
>> >
>> > Thanks
>> > Rajesh
>> >
>> >
>> > On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi <suneel_marthi@yahoo.com
>> >wrote:
>> >
>> >> The input to canopy is your vectors from seq2sparse and not cluster
>> >> centroids (as u had it), hence the error message u r seeing.
>> >>
>> >> The output of canopy could be fed into kmeans as input centroids.
>> >>
>> >>
>> >>
>> >>
>> >> ________________________________
>> >> From: Rajesh Nikam <rajeshnikam@gmail.com>
>> >> To: "user@mahout.apache.org" <user@mahout.apache.org>
>> >> Sent: Thursday, May 30, 2013 10:56 AM
>> >> Subject: bottom up clustering
>> >>
>> >>
>> >> Hi,
>> >>
>> >> I want to do bottom up clustering (rather hierarchical clustering)
>> rather
>> >> than top-down as mentioned in
>> >>
>> >> https://cwiki.apache.org/MAHOUT/top-down-clustering.html
>> >> kmeans->clusterdump->clusterpp and then kmeans on each cluster
>> >>
>> >> How to use centroid from first phase of canopy and use them for next
>> level
>> >> of course with correct t1 and t2.
>> >>
>> >> I have tried using 'canopy' which give centroids as output. How to
>> apply
>> >> one more level of clustering on these centroids ?
>> >>
>> >> /user/hadoop/t/canopy-centroids/clusters-0-final is output of first
>> level
>> >> of canopy.
>> >>
>> >> mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o
>> >> /user/hadoop/t/hclust -dm
>> >> org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2
>> 0.02
>> >> -ow
>> >>
>> >> It gave following error:
>> >>
>> >>  13/05/30 20:21:38 INFO mapred.JobClient: Task Id :
>> >> attempt_201305231030_0519_m_000000_0, Status : FAILED
>> >> java.lang.ClassCastException:
>> >> org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to
>> >> org.apache.mahout.math.VectorWritable
>> >>
>> >> Thanks
>> >> Rajesh
>> >>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message