mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <suneel_mar...@yahoo.com>
Subject Re: bottom up clustering
Date Mon, 03 Jun 2013 12:49:16 GMT
You should be able to feed arff.vectors to Streaming kmeans (have not tried that myself, never
had to work with arff ).
I had tfidf-vectors as an example, u should be good with arff.

Give it a try and let us know.




________________________________
 From: Rajesh Nikam <rajeshnikam@gmail.com>
To: "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi <suneel_marthi@yahoo.com>

Cc: Ted Dunning <ted.dunning@gmail.com> 
Sent: Monday, June 3, 2013 4:30 AM
Subject: Re: bottom up clustering
 


Hi Suneel,


I have used seqdirectory followed by seq2sparse on 20newsgroup set.


Then used following command to run streamingkmeans to get 40 clusters.


hadoop jar mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver
\
    -i /user/hadoop/news-vectors/tf-vectors/ \
    -o /user/hadoop/news-stream-kmeans \
  -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
  -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
  -k 40 \
  -km 190 \
  -testp 0.3 \
  -mi 10 \
  -ow   


dumped output using  seqdumper from  /user/hadoop/news-stream-kmeans/part-r-00000.


In the dumped file I see centroids are dumped like:

Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
Key: 0: Value: key = 0, weight = 1.00, vector = {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0,
Key: 1: Value: key = 1, weight = 3.00, vector = {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854
Key: 2: Value: key = 2, weight = 105.00, vector = {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0
Key: 3: Value: key = 28, weight = 259.00, vector = {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020

--


 more --- >

--


I have tried using arff.vector to covert arff to vector where I dont know how to covert it
to tf-idf vectors format as expected by streaming kmeans ?

Thanks
Rajesh




On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam <rajeshnikam@gmail.com> wrote:

Hi Suneel,
>
>
>Thanks a lot for detailed steps !
>
>I will try out the steps.
>
>
>Thanks, Ted for pointing this out!
>
>
>
>Thanks,
>Rajesh
>
>
>
>
>On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi <suneel_marthi@yahoo.com> wrote:
>
>To add to Ted's reply, streaming k-means was recently added to Mahout (thanks to Dan and
Ted).
>>
>>Here's the reference paper that talks about Streaming k-means:
>>
>>http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
>>
>>You have to be working off of trunk to use this, its not available as part of any
release yet.
>>
>>The steps for using Streaming k-means (I don't think its been documented yet)
>>
>>1.  Generate Sparse vectors via seq2sparse (u have this already).
>>
>>2.  mahout  streamingkmeans  -i <path to tfidf-vectors>  -o <output path>
--tempDir <temp folder path> -ow
>> -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>> -sc org.apache.mahout.math.neighborhood.FastProjectionSearch
>> -k <No. of clusters> -km <see below for the math>
>>
>>-k = no of clusters
>>-km = (k * log(n))  where k = no. of clusters and n = no. of datapoints to cluster, 
round this to the nearest integer
>>
>>You have option of using a FastProjectionSearch or ProjectionSearch or LocalitySensitiveHashSearch
for the -sc parameter.
>>
>>
>>
>>
>>
>>
>>
>>
>>________________________________
>> From: Ted Dunning <ted.dunning@gmail.com>
>>To: "user@mahout.apache.org" <user@mahout.apache.org>
>>Cc: "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi <suneel_marthi@yahoo.com>
>>Sent: Thursday, May 30, 2013 12:03 PM
>>Subject: Re: bottom up clustering
>>
>>
>>
>>Rajesh
>>
>>The streaming k-means implementation is very much like what you are asking for. 
The first pass is to cluster into many, many clusters and then cluster those clusters. 
>>
>>Sent from my iPhone
>>
>>On May 30, 2013, at 11:20, Rajesh Nikam <rajeshnikam@gmail.com> wrote:
>>
>>> Hello Suneel,
>>>
>>> I got it. Next step to canopy is to feed these centroids to kmeans and
>>> cluster.
>>>
>>> However I want is to use centroids from these clusters and do clustering on
>>> them so as to find related clusters.
>>>
>>> Thanks
>>> Rajesh
>>>
>>>
>>> On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi <suneel_marthi@yahoo.com>wrote:
>>>
>>>> The input to canopy is your vectors from seq2sparse and not cluster
>>>> centroids (as u had it), hence the error message u r seeing.
>>>>
>>>> The output of canopy could be fed into kmeans as input centroids.
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Rajesh Nikam <rajeshnikam@gmail.com>
>>>> To: "user@mahout.apache.org" <user@mahout.apache.org>
>>>> Sent: Thursday, May 30, 2013 10:56 AM
>>>> Subject: bottom up clustering
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I want to do bottom up clustering (rather hierarchical clustering) rather
>>>> than top-down as mentioned in
>>>>
>>>> https://cwiki.apache.org/MAHOUT/top-down-clustering.html
>>>> kmeans->clusterdump->clusterpp and then kmeans on each cluster
>>>>
>>>> How to use centroid from first phase of canopy and use them for next level
>>>> of course with correct t1 and t2.
>>>>
>>>> I have tried using 'canopy' which give centroids as output. How to apply
>>>> one more level of clustering on these centroids ?
>>>>
>>>> /user/hadoop/t/canopy-centroids/clusters-0-final is output of first level
>>>> of canopy.
>>>>
>>>> mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o
>>>> /user/hadoop/t/hclust -dm
>>>> org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2 0.02
>>>> -ow
>>>>
>>>> It gave following error:
>>>>
>>>>  13/05/30 20:21:38 INFO mapred.JobClient: Task Id :
>>>> attempt_201305231030_0519_m_000000_0, Status : FAILED
>>>> java.lang.ClassCastException:
>>>> org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to
>>>> org.apache.mahout.math.VectorWritable
>>>>
>>>> Thanks
>>>> Rajesh
>>>> 
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message