mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: mahout clusterdump output
Date Tue, 11 Sep 2012 14:28:07 GMT
Mattie is correct on the VL/CL notation. Convergence; however, does not 
mean that the cluster centers have stopped moving, only that their 
movement is below a certain threshold. Thus, it is entirely possible for 
a few points observed to be in cluster X in the final iteration to be 
classified into cluster Y in the final clustering output since the 
centers of X and Y were adjusted slightly - but less than the threshold 
- after the final iteration ended. Certainly, decreasing the threshold 
will minimize this phenomena and in the limit can prevent it. This will 
require more iterations; however, and you need to assess the 
cost-benefit of this course of action.

On 9/10/12 2:03 PM, Whitmore, Mattie wrote:
> VL- means you have converged, which is good.  CL- means I have clusters which have not
converged -- ie I need to run more iterations, or adjust my threshold.
>
> I don't use the commandline kmeans, rather I use the kmeansDriver api.  I have runClustering
set as true.  Is this counting discrepancy just due to the fact I have not converged for some
of my clusters -- so even though they are observed by a cluster they are not assigned to that
cluster?
>
> -----Original Message-----
> From: Yuji NISHIDA@U-Tokyo [mailto:nishidyatutokyo@gmail.com]
> Sent: Monday, September 10, 2012 12:46 PM
> To: user@mahout.apache.org
> Subject: Re: mahout clusterdump output
>
> Thank you for your kind explanation.
> I added -cl option when conducting kmeans, so it seems no longer a problem.
>
> But I also want to make sure that my clusterDump result shows "VL-", not "CL-".
> Do you think this is correct output?
>
> Best regards.
>
> 2012/9/11 Jeff Eastman <jdog@windwardsolutions.com>:
>> I think the discrepancy between the number (n=) of vectors reported by the
>> cluster and the number of points actually clustered by the -cl option is
>> normal.
>>
>> In the final iteration, points are assigned to (observed by) (classified as)
>> each cluster based upon the distance measure and the cluster center computed
>> from the previous iteration. The (n=) value records the number of points
>> "observed by" the cluster in that iteration.
>> After the final iteration, a new cluster center is calculated for each
>> cluster. This moves the center by some amount, less than the convergence
>> threshold, but it moves.
>> During the subsequent classification (-cl) step, these new centers are used
>> to classify the points for output. This will inevitably cause some points to
>> be assigned to (observed by) (classified as) a different cluster and so the
>> output clusteredPoints will reflect this final assignment.
>>
>> In small, contrived examples, the clustering will likely be more stable
>> between the final iteration and the output of clustered points.
>>
>>
>>
>> On 9/10/12 9:06 AM, Whitmore, Mattie wrote:
>>
>> Hi,
>>
>> I too am having this problem.  I have a very small dimension space (3), and
>> a lot of vectors (hundreds of millions).  Therefore I can't print all to
>> disk (I receive an OOM error).  However, I can print 30 sample points
>> easily, and doing so showed results similar to you (I "named" my vectors to
>> be the number of vectors clusterDumper printed in the cluster):
>>
>> VL-50{n=0 c=[...] r=[]}
>>          Weight : [props - optional]:  Point:
>>          1.0:    1 = [...]
>>          1.0:    2 = [...]
>> 		...
>>          1.0:   10 = [...]
>>
>> --> note also radius is blank, whereas the points do have spread in all
>> dimensions, this happened ONLY with converged clusters.
>>
>> CL-51{n=4 c=[...] r=[...]}
>>          Weight : [props - optional]:  Point:
>>          1.0:    1 = [...]
>>          1.0:    2 = [...]
>> 		...
>>          1.0:    6 = [...]
>>
>> As far as I understand the algorithm, problems which arise due to
>> dimensionality are convergence problems.  Basically, distance between points
>> is "longer" as dimension increases (volume increases dramatically as
>> dimension increases).
>>
>> This shouldn't affect clusterDumper, as clusterDumper simply reports on
>> sequence files from a completed job.  This is why the discrepancy is not
>> making a lot of sense to me.  Having more vectors within each cluster makes
>> sense -- when I sum the printed n values, I receive a number magnitudes
>> smaller than the number of vectors I clustered.
>>
>> I used Mahout v0.7, Hadoop 0.20.2-cdh3u3
>>
>>
>> -----Original Message-----
>> From: Yuji NISHIDA@U-Tokyo [mailto:nishidyatutokyo@gmail.com]
>> Sent: Sunday, September 09, 2012 4:46 AM
>> To: user@mahout.apache.org
>> Subject: Re: mahout clusterdump output
>>
>> Hi all
>>
>> I still want to confirm that this is not a problem.
>> Especially the n value, I just hope it is not problematic...
>>
>> I discussed this in my lab, one of our members noted that the dimension of
>> feature vectors and the number of vectors I used were very different.
>> I have used 100 dimensions of vector and 600,000 vectors.
>>
>> Do you think it may cause some problems if I use both small dimensions and
>> large number of vectors simultaneously and we need to make sure that there
>> is relation between them (especially in number)?
>> Or do you think 100 is too small for the dimension?
>>
>> I will appreciate very much that someone follows my question.
>>
>> Regards.
>>
>> 2012/8/4 Yuji NISHIDA@U-Tokyo <nishidyatutokyo@gmail.com>:
>>
>> Dear all
>>
>> I am working on mahout to use canopy and kmeans and got a problem
>> about clusterdump output.
>> Each vector has simple number incremented from 1 as its name.
>>
>> When I used 5,000 vectors, I got a correct output. It looks like:
>>
>> VL-0{n=64,c=[...], r[...]}
>>      1.0: 1= [...]
>>      1.0: 3= [...]
>>      1.0: 4= [...]
>>       ...
>>      1.0: 396= [...]    # The number of vectors is exactly same as n(64).
>> VL-1{n=5,c=[...], r[...]}
>>      1.0: 2= [...]
>>      1.0: 12= [...]
>>      ...
>>      1.0: 4221= [...]
>> VL-2{n=121,c=[...], r[...]}
>> ...
>>
>> Each number of vectors in VL is exactly same as its n value.
>>
>> When I used 600,000 vectors, the output looks wrong like:
>>
>> VL-0{n=14,c=[...], r[...]}
>>      1.0: 66636= [...]
>>      1.0: 122570= [...]
>>      ...
>>      1.0: 522794= [...]    # The number of vectors is 31.
>> VL-8{n=0,c=[...], r[...]}
>>      1.0: 393539= [...]
>>      1.0: 398877= [...]
>>      ...
>>      1.0: 513448= [...]    # The number of vectors is 5.
>> VL-16{n=2,c=[...], r[...]}
>> ...
>>
>> It looks VL-1 to VL-7 and VL-9 to VL-15 are not used but I confirmed
>> them existing in the output.
>> It seems using VL in order as 0,8,16,...,11552, 1,9,17,...,11553,
>> 2,10,18... and so on.
>>
>> Can I believe this result or should I doubt this is caused by some bugs?
>>
>> Hadoop : 0.20.204
>> Mahout : rev. 1351561, 1366995, 1367871
>>
>> Best regards.
>>
>> --
>> nishidy@u-tokyo
>>
>>
>>
>
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message