mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Seeding k-means with canopy clustering / Filter canopies
Date Sat, 05 Jan 2013 19:27:45 GMT
Depending upon your data, 0.7 Canopy can be extremely sensitive to the 
value you specify for T2. Somewhere between the larger T2 value that 
yields 1 canopy and the smaller T2 value that yields "the wrong number 
of [i.e. too many] centroids" lies a value that will give you fewer 
centroids. You can use a binary search strategy to adjust T2 to achieve 
numbers of centroids that seem "reasonable", but if extremely small 
perturbations of T2 give you wildly different numbers of centroids then 
your sense of reasonableness may be in question.

You can also sample random centroids from your data set by specifying -k 
(the reasonable number of centroids you seek), and KMeans will produce 
that many clusters.



On 1/3/13 9:08 AM, Stefan Kreuzer wrote:
> But even with a small weight (not sure how to apply that) i still have 
> the wrong number of centroids, i.e. the wrong k?
> I imagined something like:
>
> 1. Do canopy clustering with clusterFilter param => retrieve a folder 
> with x canopy clusters and a folder with x+n canopy centroids, where x 
> represents a good value for k.
> 2. Remove centroids that do not correspond with any of the canopy 
> clusters.
> 3. Use these reduced set of canopy centroid as seed for k-means.
>
> I dont know if step 2 is possible and if it is, how it could be 
> achieved. Performance is rather a non-issue in my case.
>
> -----Ursprüngliche Mitteilung-----
> Von: Ted Dunning <ted.dunning@gmail.com>
> An: user <user@mahout.apache.org>
> Verschickt: Do, 3 Jan 2013 4:41 pm
> Betreff: Re: Seeding k-means with canopy clustering / Filter canopies
>
>
> The knn stuff on github can run with 0.7.  You would have to pull a few
> classes back that have been moved to Mahout, but it shouldn't be hard 
> to do
> since the names and paths are identical.
>
> I have no good answer for you about using canopy centroids.  The 
> normal way
> of doing this is to put a very small or zero weight on the seed 
> centroids.
> That means that they start tings going but have very little or no
> influence later.
>
> On Thu, Jan 3, 2013 at 3:43 AM, Stefan Kreuzer 
> <stefankreuzer70@aol.de>wrote:
>
>> I fear I have to stick to 0.7. So there is no solution to get rid of 
> the
>> superfluous canopy centroids for the k-means seed?
>>
>>
>> -----Ursprüngliche Mitteilung-----
>> Von: Ted Dunning <ted.dunning@gmail.com>
>> An: user <user@mahout.apache.org>
>> Verschickt: Do, 3 Jan 2013 7:01 am
>> Betreff: Re: Seeding k-means with canopy clustering / Filter canopies
>>
>>
>> Bitlets have come into Mahout so far, but the core is in
>> https://github.com/tdunning/**knn <https://github.com/tdunning/knn> 
> still.
>>
>> The quick summary is that this code can cluster 10-dimensional data at
>> about 1 million points in 20 seconds on a single machine.  It also can
>> scale out horizontally using a single map-reduce pass maintaining 
> about the
>> same speed.  Performance scales down essentially linearly with higher
>> dimensionality.
>>
>> It works by making a fast, single pass through the data to produce a 
> sketch
>> of the data.  This sketch is clustered in memory using a high quality 
> ball
>> k-means algorithm.
>>
>> The API is currently not compatible with the current clustering API. 
> The
>> algorithms are being tested for quality by Dan Filimon who is also 
> doing
>> the scaling work.
>>
>> On Wed, Jan 2, 2013 at 6:00 PM, Stefan Kreuzer <stefankreuzer70@aol.de
>> >wrote:
>>
>>  Uhm no... where can I look? Sorry
>>>
>>>
>>>
>>>
>>> -----Ursprüngliche Mitteilung-----
>>> Von: Ted Dunning <ted.dunning@gmail.com>
>>> An: user <user@mahout.apache.org>
>>> Verschickt: Do, 3 Jan 2013 2:12 am
>>> Betreff: Re: Seeding k-means with canopy clustering / Filter canopies
>>>
>>>
>>> Stefan,
>>>
>>> Have you looked at the k-means work that Dan Filimon and I are doing?
>>>
>>> On Wed, Jan 2, 2013 at 4:46 PM, Stefan Kreuzer 
> <stefankreuzer70@aol.de
>>> >wrote:
>>>
>>> > I try to seed a k-means clustering with canopy clustering. Problem:
>>> > Depending on the choice for t1 and t2, canopy clustering gives me
>>>
>> too
>>
>>> many
>>> > canopies or just 1.
>>> > I thought I could solve this with the clusterFilter parameter, but
>>>
>> no
>>
>>> > luck. Although I can restrict the number of _canopy clusters_ with
>>>
>> the
>>
>>> > clusterFilter parameter leading to what would be a good value for
>>>
>> k, this
>>
>>> > parameter has no effect on the _canopy centroids_ that are created,
>>>
>> and
>>
>>> > these are the seed for k-means.
>>> > Is there a way to get a seed for k-means that reflects the value
>>>
>> given
>>
>>> for
>>> > the clusterFilter parameter in canopy clustering?
>>> >
>>>
>>>
>>>
>>>
>>
>
>
>
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message