mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Florian Laws <flor...@florianlaws.de>
Subject Re: bottom up clustering
Date Tue, 11 Jun 2013 20:09:26 GMT
Does the new Streaming k-means work with arbitrary distance measures?
(including custom ones?).

>From skimming the paper I got the idea that it is restricted to
Euclidean distance, but your example uses Cosine.
Which is correct?

Best,

Florian


On Thu, May 30, 2013 at 6:20 PM, Suneel Marthi <suneel_marthi@yahoo.com> wrote:
> To add to Ted's reply, streaming k-means was recently added to Mahout (thanks to Dan
and Ted).
>
> Here's the reference paper that talks about Streaming k-means:
>
> http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
>
> You have to be working off of trunk to use this, its not available as part of any release
yet.
>
> The steps for using Streaming k-means (I don't think its been documented yet)
>
> 1.  Generate Sparse vectors via seq2sparse (u have this already).
>
> 2.  mahout  streamingkmeans  -i <path to tfidf-vectors>  -o <output path>
--tempDir <temp folder path> -ow
>  -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>  -sc org.apache.mahout.math.neighborhood.FastProjectionSearch
>  -k <No. of clusters> -km <see below for the math>
>
> -k = no of clusters
> -km = (k * log(n))  where k = no. of clusters and n = no. of datapoints to cluster, 
round this to the nearest integer
>
> You have option of using a FastProjectionSearch or ProjectionSearch or LocalitySensitiveHashSearch
for the -sc parameter.
>
>
>
>
>
>
>
> ________________________________
>  From: Ted Dunning <ted.dunning@gmail.com>
> To: "user@mahout.apache.org" <user@mahout.apache.org>
> Cc: "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi <suneel_marthi@yahoo.com>
> Sent: Thursday, May 30, 2013 12:03 PM
> Subject: Re: bottom up clustering
>
>
> Rajesh
>
> The streaming k-means implementation is very much like what you are asking for.  The
first pass is to cluster into many, many clusters and then cluster those clusters.
>
> Sent from my iPhone
>
> On May 30, 2013, at 11:20, Rajesh Nikam <rajeshnikam@gmail.com> wrote:
>
>> Hello Suneel,
>>
>> I got it. Next step to canopy is to feed these centroids to kmeans and
>> cluster.
>>
>> However I want is to use centroids from these clusters and do clustering on
>> them so as to find related clusters.
>>
>> Thanks
>> Rajesh
>>
>>
>> On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi <suneel_marthi@yahoo.com>wrote:
>>
>>> The input to canopy is your vectors from seq2sparse and not cluster
>>> centroids (as u had it), hence the error message u r seeing.
>>>
>>> The output of canopy could be fed into kmeans as input centroids.
>>>
>>>
>>>
>>>
>>> ________________________________
>>> From: Rajesh Nikam <rajeshnikam@gmail.com>
>>> To: "user@mahout.apache.org" <user@mahout.apache.org>
>>> Sent: Thursday, May 30, 2013 10:56 AM
>>> Subject: bottom up clustering
>>>
>>>
>>> Hi,
>>>
>>> I want to do bottom up clustering (rather hierarchical clustering) rather
>>> than top-down as mentioned in
>>>
>>> https://cwiki.apache.org/MAHOUT/top-down-clustering.html
>>> kmeans->clusterdump->clusterpp and then kmeans on each cluster
>>>
>>> How to use centroid from first phase of canopy and use them for next level
>>> of course with correct t1 and t2.
>>>
>>> I have tried using 'canopy' which give centroids as output. How to apply
>>> one more level of clustering on these centroids ?
>>>
>>> /user/hadoop/t/canopy-centroids/clusters-0-final is output of first level
>>> of canopy.
>>>
>>> mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o
>>> /user/hadoop/t/hclust -dm
>>> org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2 0.02
>>> -ow
>>>
>>> It gave following error:
>>>
>>>  13/05/30 20:21:38 INFO mapred.JobClient: Task Id :
>>> attempt_201305231030_0519_m_000000_0, Status : FAILED
>>> java.lang.ClassCastException:
>>> org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to
>>> org.apache.mahout.math.VectorWritable
>>>
>>> Thanks
>>> Rajesh
>>>

Mime
View raw message