mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Filimon <dangeorge.fili...@gmail.com>
Subject Re: bottom up clustering
Date Tue, 04 Jun 2013 09:28:02 GMT
Hi Rajesh,

Streaming k-means clusters Vectors (that are in <*, VectorWritable>
sequence files) and outputs <IntWritable, CentroidWritable> sequence files.
A Centroid is the same as a Vector with the addition of an index and a
weight. You can getVector() a Centroid to get its Vector.




On Mon, Jun 3, 2013 at 2:49 PM, Suneel Marthi <suneel_marthi@yahoo.com>wrote:

> You should be able to feed arff.vectors to Streaming kmeans (have not
> tried that myself, never had to work with arff ).
> I had tfidf-vectors as an example, u should be good with arff.
>
> Give it a try and let us know.
>
>
>
>
> ________________________________
>  From: Rajesh Nikam <rajeshnikam@gmail.com>
> To: "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi <
> suneel_marthi@yahoo.com>
> Cc: Ted Dunning <ted.dunning@gmail.com>
> Sent: Monday, June 3, 2013 4:30 AM
> Subject: Re: bottom up clustering
>
>
>
> Hi Suneel,
>
>
> I have used seqdirectory followed by seq2sparse on 20newsgroup set.
>
>
> Then used following command to run streamingkmeans to get 40 clusters.
>
>
> hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
>     -i /user/hadoop/news-vectors/tf-vectors/ \
>     -o /user/hadoop/news-stream-kmeans \
>   -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
>   -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
>   -k 40 \
>   -km 190 \
>   -testp 0.3 \
>   -mi 10 \
>   -ow
>
>
> dumped output using  seqdumper from
> /user/hadoop/news-stream-kmeans/part-r-00000.
>
>
> In the dumped file I see centroids are dumped like:
>
> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
> Key: 0: Value: key = 0, weight = 1.00, vector =
> {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0,
> Key: 1: Value: key = 1, weight = 3.00, vector =
> {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854
> Key: 2: Value: key = 2, weight = 105.00, vector =
> {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0
> Key: 3: Value: key = 28, weight = 259.00, vector =
> {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020
>
> --
>
>
>  more --- >
>
> --
>
>
> I have tried using arff.vector to covert arff to vector where I dont know
> how to covert it to tf-idf vectors format as expected by streaming kmeans ?
>
> Thanks
> Rajesh
>
>
>
>
> On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam <rajeshnikam@gmail.com>
> wrote:
>
> Hi Suneel,
> >
> >
> >Thanks a lot for detailed steps !
> >
> >I will try out the steps.
> >
> >
> >Thanks, Ted for pointing this out!
> >
> >
> >
> >Thanks,
> >Rajesh
> >
> >
> >
> >
> >On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi <suneel_marthi@yahoo.com>
> wrote:
> >
> >To add to Ted's reply, streaming k-means was recently added to Mahout
> (thanks to Dan and Ted).
> >>
> >>Here's the reference paper that talks about Streaming k-means:
> >>
> >>http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
> >>
> >>You have to be working off of trunk to use this, its not available as
> part of any release yet.
> >>
> >>The steps for using Streaming k-means (I don't think its been documented
> yet)
> >>
> >>1.  Generate Sparse vectors via seq2sparse (u have this already).
> >>
> >>2.  mahout  streamingkmeans  -i <path to tfidf-vectors>  -o <output
> path> --tempDir <temp folder path> -ow
> >> -dm org.apache.mahout.common.distance.CosineDistanceMeasure
> >> -sc org.apache.mahout.math.neighborhood.FastProjectionSearch
> >> -k <No. of clusters> -km <see below for the math>
> >>
> >>-k = no of clusters
> >>-km = (k * log(n))  where k = no. of clusters and n = no. of datapoints
> to cluster,  round this to the nearest integer
> >>
> >>You have option of using a FastProjectionSearch or ProjectionSearch or
> LocalitySensitiveHashSearch for the -sc parameter.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>________________________________
> >> From: Ted Dunning <ted.dunning@gmail.com>
> >>To: "user@mahout.apache.org" <user@mahout.apache.org>
> >>Cc: "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi <
> suneel_marthi@yahoo.com>
> >>Sent: Thursday, May 30, 2013 12:03 PM
> >>Subject: Re: bottom up clustering
> >>
> >>
> >>
> >>Rajesh
> >>
> >>The streaming k-means implementation is very much like what you are
> asking for.  The first pass is to cluster into many, many clusters and then
> cluster those clusters.
> >>
> >>Sent from my iPhone
> >>
> >>On May 30, 2013, at 11:20, Rajesh Nikam <rajeshnikam@gmail.com> wrote:
> >>
> >>> Hello Suneel,
> >>>
> >>> I got it. Next step to canopy is to feed these centroids to kmeans and
> >>> cluster.
> >>>
> >>> However I want is to use centroids from these clusters and do
> clustering on
> >>> them so as to find related clusters.
> >>>
> >>> Thanks
> >>> Rajesh
> >>>
> >>>
> >>> On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi <
> suneel_marthi@yahoo.com>wrote:
> >>>
> >>>> The input to canopy is your vectors from seq2sparse and not cluster
> >>>> centroids (as u had it), hence the error message u r seeing.
> >>>>
> >>>> The output of canopy could be fed into kmeans as input centroids.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ________________________________
> >>>> From: Rajesh Nikam <rajeshnikam@gmail.com>
> >>>> To: "user@mahout.apache.org" <user@mahout.apache.org>
> >>>> Sent: Thursday, May 30, 2013 10:56 AM
> >>>> Subject: bottom up clustering
> >>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> I want to do bottom up clustering (rather hierarchical clustering)
> rather
> >>>> than top-down as mentioned in
> >>>>
> >>>> https://cwiki.apache.org/MAHOUT/top-down-clustering.html
> >>>> kmeans->clusterdump->clusterpp and then kmeans on each cluster
> >>>>
> >>>> How to use centroid from first phase of canopy and use them for next
> level
> >>>> of course with correct t1 and t2.
> >>>>
> >>>> I have tried using 'canopy' which give centroids as output. How to
> apply
> >>>> one more level of clustering on these centroids ?
> >>>>
> >>>> /user/hadoop/t/canopy-centroids/clusters-0-final is output of first
> level
> >>>> of canopy.
> >>>>
> >>>> mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o
> >>>> /user/hadoop/t/hclust -dm
> >>>> org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01
> -t2 0.02
> >>>> -ow
> >>>>
> >>>> It gave following error:
> >>>>
> >>>>  13/05/30 20:21:38 INFO mapred.JobClient: Task Id :
> >>>> attempt_201305231030_0519_m_000000_0, Status : FAILED
> >>>> java.lang.ClassCastException:
> >>>> org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast
> to
> >>>> org.apache.mahout.math.VectorWritable
> >>>>
> >>>> Thanks
> >>>> Rajesh
> >>>>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message