mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: kmeans not returning k clusters
Date Wed, 09 May 2012 19:24:40 GMT
Does this cluster reduction happen when you prime k-means with canopy? 
Can you first adjust T1==T2 to get about 200 canopies and feed that to 
k-means? How wide are your term vectors? Have you tried other distance 
measures?

If anybody else out there is experiencing similar problems, please chime in.

Jeff

On 5/9/12 1:07 PM, Pat Ferrel wrote:
> That's what I'm doing now. Random seeds is not really the best way to 
> do kmeans. However my results are repeatable as far as I've gone. And 
> canopy wants to generate a much larger set of clusters, with a wide 
> range of T1 and T2 for this data set so the theory that it does not 
> support 30 clusters seems unlikely although the may be a fair distance 
> apart.
>
> Since I've tried several times with several random seed so the "seeds 
> are too close" theory doesn't seem likely.
> Given canopy wants to generate more clusters, the "doesn't support k = 
> 30" theory doesn't seem likely.
>
> I'm not saying that there is a real problem here but when I noticed it 
> I had 16,000 documents and was asking for 200 clusters and got 38. If 
> there is some good reason for this it would be nice to find it and 
> report it to the user. The "good reason" might be very helpful in the 
> analysis. Or it could be a bug.
>
> At least it's out there in case others are seeing lost clusters.
>
> On 5/9/12 7:49 AM, Jeff Eastman wrote:
>> Paratosh is correct in his analysis. K-means can work itself into a 
>> situation where there are some empty clusters if the initial cluster 
>> centers are too closely spaced or if the data really doesn't support 
>> k clusters. This is because it assigns each vector to the most likely 
>> (closest) cluster. If two prior clusters are very close together this 
>> can cause one of them to become empty.
>>
>> Have you tried priming k-means with canopy instead of the random 
>> sampler?
>>
>> On 5/9/12 10:35 AM, Pat Ferrel wrote:
>>> I suspect you are right Paritosh. I ran the random seed with kmean 
>>> several times on the supplied data set and always got 28 rather than 
>>> 30 clusters. I don't care so much about the number but it might mean 
>>> that some clusters are thrown out and without looking you couldn't 
>>> tell if they were important ones or not. Just upping k to 32 doesn't 
>>> really work if you still get some thrown out.
>>>
>>> At least i think the issue is repeatable with this data.
>>>
>>> On 5/9/12 1:14 AM, Paritosh Ranjan wrote:
>>>> Printouts of Mahout vectors prints only the non-zero elements.
>>>> So, the centers are not empty, rather they are zero.
>>>>
>>>> Prima facie, I suspect that you are getting lot of empty clusters. 
>>>> This might be occurring due to the combination of distance measure, 
>>>> convergence threshold and distances between vectors.
>>>> Can you try to analyze and change/play around with these parameters?
>>>>
>>>> I will try to look into how the Random Cluster Initialization is 
>>>> working. I will log a jira if I find some issue. However, I think 
>>>> that there will be no problem in cluster initialization part.
>>>>
>>>> On 09-05-2012 03:21, Danfeng Li wrote:
>>>>> I got the same issue. What I found is that the initial centers 
>>>>> have many empty ones, the final number of clusters are decided by 
>>>>> the number of nonempty centers.
>>>>>
>>>>> Here are some example of my cases:
>>>>>
>>>>> ...
>>>>> CL-34358205{n=0 c=[] r=[]}
>>>>> CL-34358207{n=0 c=[] r=[]}
>>>>> CL-34358209{n=0 c=[] r=[]}
>>>>> CL-34358213{n=0 c=[0:1.000] r=[]}
>>>>> CL-34358215{n=0 c=[] r=[]}
>>>>> CL-34358216{n=0 c=[] r=[]}
>>>>> CL-34358217{n=0 c=[] r=[]}
>>>>> CL-34358220{n=0 c=[] r=[]}
>>>>> CL-34358221{n=0 c=[] r=[]}
>>>>> CL-34358222{n=0 c=[] r=[]}
>>>>> CL-34358223{n=0 c=[] r=[]}
>>>>> CL-34358224{n=0 c=[] r=[]}
>>>>> CL-34358227{n=0 c=[0:1.000] r=[]}
>>>>> CL-34358228{n=0 c=[] r=[]}
>>>>> CL-34358229{n=0 c=[] r=[]}
>>>>> ...
>>>>>
>>>>> Is it the case there is a bug in initialization?
>>>>>
>>>>> Thanks.
>>>>> Dan
>>>>>
>>>>> -----Original Message-----
>>>>> From: Pat Ferrel [mailto:pat@occamsmachete.com]
>>>>> Sent: Tuesday, May 08, 2012 9:13 AM
>>>>> To: user@mahout.apache.org
>>>>> Subject: Re: kmeans not returning k clusters
>>>>>
>>>>> Here is a sample data set. In this case I asked for 30 and got 28 
>>>>> but in other cases the discrepancy has been greater like ask for 
>>>>> 200 and get 38 but that was for a much larger data set.
>>>>>
>>>>> Running on my mac laptop in a single node pseudo cluster hadoop 
>>>>> 0.20.205, mahout 0.6
>>>>>
>>>>> command line:
>>>>>
>>>>> mahout kmeans \
>>>>>       -i b2/bixo-vectors/tfidf-vectors/ \
>>>>>       -c b2/bixo-kmeans-centroids \
>>>>>       -cl \
>>>>>       -o b2/bixo-kmeans-clusters \
>>>>>       -k 30 \
>>>>>       -ow \
>>>>>       -cd 0.01 \
>>>>>       -x 20 \
>>>>>       -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>>>>>
>>>>> Find the data here:
>>>>> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740

>>>>>
>>>>>
>>>>> BTW when I run rowsimilarity asking for 20 similar docs I get a 
>>>>> max of
>>>>> 20 but sometimes many less. Shouldn't this always return the 
>>>>> requested number? I'll post this question again to the the 
>>>>> attention of the right person.
>>>>>
>>>>> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>>>>>> I looked at the 0.6 version's code but was not able to find any 
>>>>>> reason.
>>>>>> If possible, can you share the data you are trying to cluster along
>>>>>> with the execution parameters?
>>>>>>
>>>>>> You can also open a Jira for this and provide the info there.
>>>>>>
>>>>>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>>>>>> 0.6
>>>>>>>
>>>>>>> I take it this is not expected behavior? I could be doing something
>>>>>>> stupid. I only look in the "final" directory. Looking in the
others
>>>>>>> with clusterdump shows the same number of clusters and I assumed

>>>>>>> they
>>>>>>> were iterations.
>>>>>>>
>>>>>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>>>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>>>>>
>>>>>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>>>>>> What would cause kmeans to not return k clusters? As
I tweak
>>>>>>>>> parameters I get different numbers of clusters but it's
usually
>>>>>>>>> less than the k I pass in. Since I am not using canopies
at 
>>>>>>>>> present
>>>>>>>>> I would expect k to always be honored but the quality
of the
>>>>>>>>> clusters would depend on the convergence amount and number
of
>>>>>>>>> iterations allowed. No?
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message