mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lokendra Singh <lsingh....@gmail.com>
Subject Re: Difference in KMeans performance with Mahout-0.3 and Mahout-0.4
Date Mon, 17 Jan 2011 17:51:04 GMT
@Jeff :Every parameter : conv threshold, number of clusters (i.e 4000) and
Input Points and Input Clusters are same for both the cases.
I did not generate the initial cluster randomly but rather generated the
initial 'k' clusters with 'first' 'k' Input points as their centroids.
Hence, initial clusters are same in both the cases.
Each DenseVector is of cardinality 64 (all doubles).

@Robin: I have been using Euclidean Distance measure in both the cases.
Actually, I am not using the mahout command line stuff, but rather directly
accessing the API  by KMeansDriver.runJob() (mahout-0.3)  and
KMeansDriver.run() (mahout-0.4) methods, hence  default values is not a
problem

I would try Random initialization of clusters and report the behavior again.


Regards
Lokendra


On Mon, Jan 17, 2011 at 11:09 PM, Jeff Eastman
<jdog@windwardsolutions.com>wrote:

> Good call Robin,
> IIRC the default distance measure did change from Euclidean to
> SquaredEuclidean. Try specifying the DM directly using the -dm option to
> force the same DistanceMeasure.
>
>
> On 1/17/11 10:09 AM, Robin Anil wrote:
>
>> Are the distance measure classes same in both runs? There could be changes
>> in default values, which are causing this. do a --help to see the default
>> values for cmdline flags
>>
>> Robin
>>
>> On Mon, Jan 17, 2011 at 10:25 PM, Ted Dunning<ted.dunning@gmail.com>
>>  wrote:
>>
>>  4000 clusters is a lot as well.
>>>
>>> Did the 0.3 solution have lots of clusters with single members?
>>>
>>> On Mon, Jan 17, 2011 at 8:46 AM, Jeff Eastman<jdog@windwardsolutions.com
>>>
>>>> wrote:
>>>> I can't think of any architectural changes which would cause the
>>>> convergence performance to change but this is a curious indeed. I see
>>>> you
>>>> are using DenseVectors but you did not say what their cardinality is.
>>>> You
>>>> also did not say how you generated the initial clusters (canopy or
>>>> random
>>>> sample). Can you run the 0.4 k-means with the initial clusters from your
>>>>
>>> 0.3
>>>
>>>> run? That would tend to isolate the change to either k-means itself or
>>>>
>>> the
>>>
>>>> the sampling algorithm in RandomSeedGenerator. A poor set of initial
>>>> clusters could greatly impact the convergence so that is where I'd
>>>>
>>> suggest
>>>
>>>> starting.
>>>>
>>>> Jeff
>>>>
>>>> On 1/17/11 9:04 AM, Lokendra Singh wrote:
>>>>
>>>>  Hi all,
>>>>>
>>>>> I am running KMeans clustering algorithm to cluster about 60K points
>>>>> (DenseVectors) into 4K clusters on my Hadoop Cluster.
>>>>> I initialized the clusters with initial 'k' points  as centroids(4000)
>>>>>
>>>> and
>>>
>>>> kept the convergence threshold pretty low (0.001).
>>>>>
>>>>> I tried running it with Mahout-0.3 and 0.4 version and found huge
>>>>> difference
>>>>> in their performance.
>>>>> The rate of convergence was pretty high with mahout-0.3 ( in 1st
>>>>>
>>>> iteration
>>>
>>>> about 600 clusters (out of 4000) converged, by 6th iteration almost 3500
>>>>> clusters (out of 4000) converged).
>>>>> While with mahout-0.4, I observed just 10 clusters (out of 4000)
>>>>> converging
>>>>> even after 10 iterations.
>>>>>
>>>>> What architectural difference between implementation of KMeans of
>>>>> mahout-0.4
>>>>> and mahout-0.3 might be causing this difference in performance?
>>>>>
>>>>> Regards
>>>>> Lokendra
>>>>>
>>>>>
>>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message