Thank you for your kind explanation.
I added cl option when conducting kmeans, so it seems no longer a problem.
But I also want to make sure that my clusterDump result shows "VL", not "CL".
Do you think this is correct output?
Best regards.
2012/9/11 Jeff Eastman <jdog@windwardsolutions.com>:
> I think the discrepancy between the number (n=) of vectors reported by the
> cluster and the number of points actually clustered by the cl option is
> normal.
>
> In the final iteration, points are assigned to (observed by) (classified as)
> each cluster based upon the distance measure and the cluster center computed
> from the previous iteration. The (n=) value records the number of points
> "observed by" the cluster in that iteration.
> After the final iteration, a new cluster center is calculated for each
> cluster. This moves the center by some amount, less than the convergence
> threshold, but it moves.
> During the subsequent classification (cl) step, these new centers are used
> to classify the points for output. This will inevitably cause some points to
> be assigned to (observed by) (classified as) a different cluster and so the
> output clusteredPoints will reflect this final assignment.
>
> In small, contrived examples, the clustering will likely be more stable
> between the final iteration and the output of clustered points.
>
>
>
> On 9/10/12 9:06 AM, Whitmore, Mattie wrote:
>
> Hi,
>
> I too am having this problem. I have a very small dimension space (3), and
> a lot of vectors (hundreds of millions). Therefore I can't print all to
> disk (I receive an OOM error). However, I can print 30 sample points
> easily, and doing so showed results similar to you (I "named" my vectors to
> be the number of vectors clusterDumper printed in the cluster):
>
> VL50{n=0 c=[...] r=[]}
> Weight : [props  optional]: Point:
> 1.0: 1 = [...]
> 1.0: 2 = [...]
> ...
> 1.0: 10 = [...]
>
> > note also radius is blank, whereas the points do have spread in all
> dimensions, this happened ONLY with converged clusters.
>
> CL51{n=4 c=[...] r=[...]}
> Weight : [props  optional]: Point:
> 1.0: 1 = [...]
> 1.0: 2 = [...]
> ...
> 1.0: 6 = [...]
>
> As far as I understand the algorithm, problems which arise due to
> dimensionality are convergence problems. Basically, distance between points
> is "longer" as dimension increases (volume increases dramatically as
> dimension increases).
>
> This shouldn't affect clusterDumper, as clusterDumper simply reports on
> sequence files from a completed job. This is why the discrepancy is not
> making a lot of sense to me. Having more vectors within each cluster makes
> sense  when I sum the printed n values, I receive a number magnitudes
> smaller than the number of vectors I clustered.
>
> I used Mahout v0.7, Hadoop 0.20.2cdh3u3
>
>
> Original Message
> From: Yuji NISHIDA@UTokyo [mailto:nishidyatutokyo@gmail.com]
> Sent: Sunday, September 09, 2012 4:46 AM
> To: user@mahout.apache.org
> Subject: Re: mahout clusterdump output
>
> Hi all
>
> I still want to confirm that this is not a problem.
> Especially the n value, I just hope it is not problematic...
>
> I discussed this in my lab, one of our members noted that the dimension of
> feature vectors and the number of vectors I used were very different.
> I have used 100 dimensions of vector and 600,000 vectors.
>
> Do you think it may cause some problems if I use both small dimensions and
> large number of vectors simultaneously and we need to make sure that there
> is relation between them (especially in number)?
> Or do you think 100 is too small for the dimension?
>
> I will appreciate very much that someone follows my question.
>
> Regards.
>
> 2012/8/4 Yuji NISHIDA@UTokyo <nishidyatutokyo@gmail.com>:
>
> Dear all
>
> I am working on mahout to use canopy and kmeans and got a problem
> about clusterdump output.
> Each vector has simple number incremented from 1 as its name.
>
> When I used 5,000 vectors, I got a correct output. It looks like:
>
> VL0{n=64,c=[...], r[...]}
> 1.0: 1= [...]
> 1.0: 3= [...]
> 1.0: 4= [...]
> ...
> 1.0: 396= [...] # The number of vectors is exactly same as n(64).
> VL1{n=5,c=[...], r[...]}
> 1.0: 2= [...]
> 1.0: 12= [...]
> ...
> 1.0: 4221= [...]
> VL2{n=121,c=[...], r[...]}
> ...
>
> Each number of vectors in VL is exactly same as its n value.
>
> When I used 600,000 vectors, the output looks wrong like:
>
> VL0{n=14,c=[...], r[...]}
> 1.0: 66636= [...]
> 1.0: 122570= [...]
> ...
> 1.0: 522794= [...] # The number of vectors is 31.
> VL8{n=0,c=[...], r[...]}
> 1.0: 393539= [...]
> 1.0: 398877= [...]
> ...
> 1.0: 513448= [...] # The number of vectors is 5.
> VL16{n=2,c=[...], r[...]}
> ...
>
> It looks VL1 to VL7 and VL9 to VL15 are not used but I confirmed
> them existing in the output.
> It seems using VL in order as 0,8,16,...,11552, 1,9,17,...,11553,
> 2,10,18... and so on.
>
> Can I believe this result or should I doubt this is caused by some bugs?
>
> Hadoop : 0.20.204
> Mahout : rev. 1351561, 1366995, 1367871
>
> Best regards.
>
> 
> nishidy@utokyo
>
>
>

nishidy@utokyo
